Skip to content

PDF stage 4.12: non-embedded font substitution + standard-14 AFM widths#581

Open
andiwand wants to merge 1 commit into
mainfrom
pdf-stage-4.12-nonembedded-fonts
Open

PDF stage 4.12: non-embedded font substitution + standard-14 AFM widths#581
andiwand wants to merge 1 commit into
mainfrom
pdf-stage-4.12-nonembedded-fonts

Conversation

@andiwand

@andiwand andiwand commented Jul 4, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

First of the stage-4 close-out stack (stacked on pdf-text-selection).

What

A non-embedded font (no /FontFile*) was rendered in the browser's default family, and advance_width returned /MissingWidth (usually 0) for it — so the common base-14 case (Helvetica/Times/Courier without an embedded program) placed and spaced text wrong and had no sensible family.

  • pdf_afm_data.{hpp,cpp} — Adobe Core-14 AFM glyph metrics (name→width, the built-in code→width table, ascent/descent/cap-height), generated by tools/pdf/generate_afm_data.py from a pinned Apache PDFBox tag. No build-time dependency; regenerate on demand, exactly like the AGL / CMap generators.
  • pdf_afm.{hpp,cpp}resolve_font_substitute maps a /BaseFont name + /FontDescriptor hints (/Flags, /FontWeight, /ItalicAngle) to a CSS font-family stack (serif/sans/mono + bold/italic) and the standard-14 font whose AFM widths back placement; afm_width / afm_code_width look them up.
  • Font::substitute resolved at parse time for non-embedded simple fonts. advance_width falls back to the AFM — via the /Encoding glyph name, or the built-in encoding for Symbol/ZapfDingbats — when /Widths doesn't cover a code. It also supplies the baseline shift when the descriptor has no /Ascent.
  • HTML — fallback (font == 0) runs carry an interned ff font-family class in both the dual-layer visual layer and the single-layer path (the substitute is part of the single-layer flow key so two families never share a line block).

Verification

  • Unit tests: classifier (name + flags + weight/angle), AFM lookups, and the advance_width fallback (encoding, built-in, and /Widths precedence). All 206 PDF tests pass.
  • End-to-end on a non-embedded serif form: renders .ff1{font-family:'Times New Roman',Times,serif} in both text modes.

Note

Corpus reference output changes (added ff classes, corrected widths) and must be regenerated in the reference-output submodules — left manual per the plan.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4582fbffeb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +63 to +65
if (const std::optional<double> width =
afm_code_width(metrics, static_cast<std::uint8_t>(code))) {
return *width / 1000.0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid built-in widths after explicit encodings

When a standard-14 font has an explicit /Encoding and the encoded glyph name is not present in the AFM table (for example an AGL suffix such as /A.alt, which this codebase already strips for Unicode mapping), this falls through to afm_code_width and uses the width of the font's built-in glyph at the same numeric code. That code slot can be unrelated to the /Differences glyph, so the advance is wrong and subsequent text placement drifts; after an explicit encoding miss, prefer stripping the suffix for the AFM lookup or falling back to /MissingWidth rather than the built-in code table.

Useful? React with 👍 / 👎.

Base automatically changed from pdf-text-selection to main July 4, 2026 09:00
A font with no `/FontFile*` was rendered in the browser's default family with
`advance_width` returning `/MissingWidth` (usually 0), so base-14 text (the
common non-embedded case) placed and spaced wrong.

- `pdf_afm_data.{hpp,cpp}`: the Adobe Core-14 AFM glyph metrics (name->width,
  the built-in code->width table, header ascent/descent/cap-height), generated
  by `tools/pdf/generate_afm_data.py` from a pinned Apache PDFBox tag (no
  build-time dependency; regenerate on demand like the AGL/CMap generators).
- `pdf_afm.{hpp,cpp}`: `resolve_font_substitute` maps a `/BaseFont` name +
  `/FontDescriptor` hints (`/Flags`, `/FontWeight`, `/ItalicAngle`) to a CSS
  `font-family` stack (serif/sans/mono + bold/italic) and the standard-14 font
  whose AFM widths back placement; `afm_width`/`afm_code_width` look them up.
- `Font::substitute` resolved at parse time for non-embedded simple fonts;
  `advance_width` falls back to the AFM (via the `/Encoding` glyph name, or the
  built-in encoding for Symbol/ZapfDingbats) when `/Widths` does not cover a
  code. A substitute also supplies the baseline shift when the descriptor
  declares no `/Ascent`.
- HTML: fallback (`font == 0`) runs carry an interned `ff` font-family class in
  both the dual-layer visual layer and the single-layer path (the substitute is
  folded into the single-layer flow key so two families never share a block).

Verified end-to-end: a non-embedded serif form renders `.ff1{font-family:'Times
New Roman',Times,serif}` in both text modes. Unit tests cover the classifier,
the AFM lookups and the `advance_width` fallback.

Note: corpus reference output changes (added `ff` classes, corrected widths) and
must be regenerated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018g7RqxjUq8rKTBTMwMfWMy
@andiwand andiwand force-pushed the pdf-stage-4.12-nonembedded-fonts branch from 4582fbf to 9b74560 Compare July 4, 2026 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant