PDF stage 4.12: non-embedded font substitution + standard-14 AFM widths#581
PDF stage 4.12: non-embedded font substitution + standard-14 AFM widths#581andiwand wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4582fbffeb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (const std::optional<double> width = | ||
| afm_code_width(metrics, static_cast<std::uint8_t>(code))) { | ||
| return *width / 1000.0; |
There was a problem hiding this comment.
Avoid built-in widths after explicit encodings
When a standard-14 font has an explicit /Encoding and the encoded glyph name is not present in the AFM table (for example an AGL suffix such as /A.alt, which this codebase already strips for Unicode mapping), this falls through to afm_code_width and uses the width of the font's built-in glyph at the same numeric code. That code slot can be unrelated to the /Differences glyph, so the advance is wrong and subsequent text placement drifts; after an explicit encoding miss, prefer stripping the suffix for the AFM lookup or falling back to /MissingWidth rather than the built-in code table.
Useful? React with 👍 / 👎.
A font with no `/FontFile*` was rendered in the browser's default family with
`advance_width` returning `/MissingWidth` (usually 0), so base-14 text (the
common non-embedded case) placed and spaced wrong.
- `pdf_afm_data.{hpp,cpp}`: the Adobe Core-14 AFM glyph metrics (name->width,
the built-in code->width table, header ascent/descent/cap-height), generated
by `tools/pdf/generate_afm_data.py` from a pinned Apache PDFBox tag (no
build-time dependency; regenerate on demand like the AGL/CMap generators).
- `pdf_afm.{hpp,cpp}`: `resolve_font_substitute` maps a `/BaseFont` name +
`/FontDescriptor` hints (`/Flags`, `/FontWeight`, `/ItalicAngle`) to a CSS
`font-family` stack (serif/sans/mono + bold/italic) and the standard-14 font
whose AFM widths back placement; `afm_width`/`afm_code_width` look them up.
- `Font::substitute` resolved at parse time for non-embedded simple fonts;
`advance_width` falls back to the AFM (via the `/Encoding` glyph name, or the
built-in encoding for Symbol/ZapfDingbats) when `/Widths` does not cover a
code. A substitute also supplies the baseline shift when the descriptor
declares no `/Ascent`.
- HTML: fallback (`font == 0`) runs carry an interned `ff` font-family class in
both the dual-layer visual layer and the single-layer path (the substitute is
folded into the single-layer flow key so two families never share a block).
Verified end-to-end: a non-embedded serif form renders `.ff1{font-family:'Times
New Roman',Times,serif}` in both text modes. Unit tests cover the classifier,
the AFM lookups and the `advance_width` fallback.
Note: corpus reference output changes (added `ff` classes, corrected widths) and
must be regenerated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018g7RqxjUq8rKTBTMwMfWMy
4582fbf to
9b74560
Compare
🤖 Generated with Claude Code
First of the stage-4 close-out stack (stacked on
pdf-text-selection).What
A non-embedded font (no
/FontFile*) was rendered in the browser's default family, andadvance_widthreturned/MissingWidth(usually 0) for it — so the common base-14 case (Helvetica/Times/Courier without an embedded program) placed and spaced text wrong and had no sensible family.pdf_afm_data.{hpp,cpp}— Adobe Core-14 AFM glyph metrics (name→width, the built-in code→width table, ascent/descent/cap-height), generated bytools/pdf/generate_afm_data.pyfrom a pinned Apache PDFBox tag. No build-time dependency; regenerate on demand, exactly like the AGL / CMap generators.pdf_afm.{hpp,cpp}—resolve_font_substitutemaps a/BaseFontname +/FontDescriptorhints (/Flags,/FontWeight,/ItalicAngle) to a CSSfont-familystack (serif/sans/mono + bold/italic) and the standard-14 font whose AFM widths back placement;afm_width/afm_code_widthlook them up.Font::substituteresolved at parse time for non-embedded simple fonts.advance_widthfalls back to the AFM — via the/Encodingglyph name, or the built-in encoding for Symbol/ZapfDingbats — when/Widthsdoesn't cover a code. It also supplies the baseline shift when the descriptor has no/Ascent.font == 0) runs carry an internedfffont-family class in both the dual-layer visual layer and the single-layer path (the substitute is part of the single-layer flow key so two families never share a line block).Verification
advance_widthfallback (encoding, built-in, and/Widthsprecedence). All 206 PDF tests pass..ff1{font-family:'Times New Roman',Times,serif}in both text modes.Note
Corpus reference output changes (added
ffclasses, corrected widths) and must be regenerated in the reference-output submodules — left manual per the plan.