Skip to content

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579

Open
andiwand wants to merge 28 commits into
mainfrom
pdf-text-selection
Open

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579
andiwand wants to merge 28 commits into
mainfrom
pdf-text-selection

Conversation

@andiwand

@andiwand andiwand commented Jul 1, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Summary

Combines the prototypes from #577 and #578 into a single implementation with a user-selectable mode.

  • Adds PdfTextMode enum to HtmlConfig (dual_layer default, single_layer opt-in)
  • Both modes use line blocks (position:absolute on the line <div>, margin-left on inline run <span>s) rather than per-glyph absolute positioning — forward-compatible with future paragraph grouping

Dual-layer mode (PdfTextMode::dual_layer, default)

Similar approach to pdf.js:

  • Visual layer (<div class="vis" aria-hidden="true">): paint-order glyph rendering using fonts re-encoded to the Private Use Area. Invisible text (Tr 3/7) omitted.
  • Selection/search layer (<div class="sel">): transparent real-Unicode text in reading order. Runs grouped into per-baseline line blocks; gap detection inserts display:inline-block spacer spans. Each run span uses CSS text-align:justify; text-align-last:justify; text-justify:inter-character to spread characters to match the PDF advance — no JavaScript.

Single-layer mode (PdfTextMode::single_layer)

Similar approach to pdf2htmlEX:

  • Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font across all pages, then picks the most-frequent glyph for each Unicode character as the cmap winner (common case wins, not first-come-first-serve).
  • Clean runs (all uchar→glyph pairs match the winner): real Unicode rendered directly in the embedded font — natively selectable and findable.
  • Unclean runs: glyphs painted via ::before{content:attr(data-g)} CSS generated content with a zero-width display:inline-block; overflow:hidden overlay <span> carrying the real Unicode for selection.
  • PUA-only characters (no Unicode mapping): remain visible but unselectable.

Test plan

  • Build passes, all 658 tests pass
  • Dual-layer output (style-various-1.pdf): class="vis" aria-hidden + class="sel" divs present; visual spans contain PUA bytes; selection spans contain readable Unicode
  • Single-layer output (--single flag on CLI): gl + ov classes present; data-g attributes contain PUA bytes; inline text contains readable Unicode
  • Both modes render visually correct in browser
  • Text selection and find-in-page work in both modes

andiwand and others added 2 commits July 1, 2026 19:06
Introduces a `PdfTextMode` enum with two values:
- `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode
  selection/search layer. Default.
- `single_layer`: single combined layer with frequency-based Unicode
  mapping, similar to pdf2htmlEX.

The active mode is controlled by `HtmlConfig::pdf_text_mode`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the single-glyph-per-absolute-span approach with two modes,
both using line blocks (position:absolute on the line div, margin-left
on inline run spans) instead of per-glyph absolute positioning.

Dual-layer mode (default, PdfTextMode::dual_layer):
- Visual layer (<div class="vis" aria-hidden>): paint-order glyph
  rendering. Fonts re-encoded to PUA. Invisible text omitted.
- Selection layer (<div class="sel">): transparent real-Unicode text.
  Runs grouped into line blocks by baseline; space detection inserts
  gap spans. Each run span is display:inline-block with CSS justify
  (text-align:justify; text-align-last:justify; text-justify:inter-
  character) so characters fill the PDF advance without JavaScript.
- Similar approach to pdf.js.

Single-layer mode (PdfTextMode::single_layer):
- One combined layer per page in paint order.
- Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences
  per font, then picks the most-frequent glyph as the cmap entry —
  so the common case wins, not first-come-first-serve.
- Clean runs (all uchar→glyph pairs match the winner) render the real
  Unicode directly in the embedded font — natively selectable.
- Unclean runs paint glyphs via ::before{content:attr(data-g)} with
  a zero-width display:inline-block overlay span for selectability.
- PUA-only chars (no Unicode mapping) remain visible but unselectable.
- Similar approach to pdf2htmlEX.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74f51ee76f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/html/pdf_file.cpp Outdated
Comment thread src/odr/internal/html/pdf_file.cpp Outdated
andiwand and others added 3 commits July 1, 2026 22:29
Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`,
`escape_markup`) and a template `handle_graphic_element` replace the
copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs).
The single-layer `add_class` captures `styles` from scope to match the
dual-layer signature; `AtomicStyles styles` is moved up before the pre-
pass so the capture is valid.

Two dual-layer correctness fixes (from code-review):
- Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero,
  so embedded glyphs space correctly for PDFs with custom char/word
  spacing.
- Move vis_prev_* state updates inside the `if (!invisible)` block so
  invisible/clip-mode runs do not shift the next visible run's position.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Adds a standalone test that translates style-various-1.pdf through both
dual_layer and single_layer modes and asserts the output document.html
contains the expected marker classes (vis+sel for dual, line-block t
for single). Prevents silent regressions if a mode is broken.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
andiwand and others added 11 commits July 2, 2026 14:19
pt_to_px/pt_to_in, the SFNT/CFF usability probe, the fvN/fnN class
helper, the run's left/top-or-matrix placement classes, and the
post-pass font-face/style writer were each copy-pasted between the
dual-layer and single-layer paths. Hoist them into shared statics
(add_position_classes, font_is_usable, font_class, write_font_face)
used by both. Verified byte-identical document.html output for both
PdfTextMode values across several PDF fixtures before/after.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Tight-continuation runs were merged into the previous .sr span's text
without recomputing its declared width, leaving it at the first
sub-run's width while the visible text grew arbitrarily longer (e.g.
"Particle Acceleration and Detection" declared 10px wide). Track each
open run's starting x-offset and re-derive the width on merge.

Also propagate font-size to the selection layer (runs, gap spacers,
and the trailing space that closes a line), which previously inherited
the browser default and could overflow/clip against the PDF-derived
width, desyncing the invisible hit-test text from the true glyph run.
…rder

.sg (gap spacer) lacked the overflow:hidden that .sr (text run) has;
per CSS an inline-block's baseline is its content's text baseline when
overflow is visible but the bottom margin edge otherwise, so the two
box types baseline-aligned differently within the same line, visibly
shifting spaces in y. Give .sg the same overflow:hidden.

Also content-stream order doesn't always run top-to-bottom (margins,
columns), which made drag-selection highlight rows inconsistently.
Stable-sort each page's selection lines by baseline y after the page
is fully processed, keeping content-stream (x) order intact for lines
on the same row.
…cmap

Large glyph counts exceeded the 6400-slot BMP PUA and threw. Spill the
overflow into Supplementary PUA-A and emit a format-12 cmap subtable to
cover the beyond-BMP code points, clamping OS/2 usFirst/usLastCharIndex
to 0xFFFF. Also add configurable dual-layer selection fallback fonts and
a size-adjust so the invisible selection text widths track the PDF boxes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Correctness:
- Treat pure 180° rotation (a=d=-1) as a matrix transform by also
  requiring m.a > 0 for the axis-aligned fast path; previously it fed a
  negative m.a into font-size and the left/top math. Both modes.
- Guard dual-layer visual word-spacing: it is inert on PUA glyph runs
  (which never emit a literal space) and must skip composite fonts (PDF
  Tw applies only to single-byte code 32), matching single-layer.
- Measure the selection-layer line-break against the previous run's
  font size, not the current run's, so it can't drift from the visual
  and single-layer heuristics. Extracted a shared starts_new_line().
- Quantize the selection-line sort key to 0.1px so float-noise baselines
  on the same row don't reorder same-row lines.

Cleanups:
- SingleRunOut::color stores the class name without a leading space.
- Collapse-check loop breaks early and drops a redundant text.font check.
- Unify class prefixes: ws = word-spacing everywhere (w = width).
- Comment escape_markup (why not html::escape_text) and the pre-pass
  double parse.

Test: emit the single-layer PdfTextMode alongside the dual-layer output
for one representative PDF under a `-single` suffix, so both text modes
are covered by reference-output diffing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Extract the byte-identical machinery the two PdfTextMode orchestrators
duplicated into static helpers, so each mode reduces to its actual
essence (grouping policy + span emission) and the shared logic has a
single source of truth (structurally preventing drift like the earlier
line-break / 180°-rotation divergences):

- RunGeometry + run_geometry(): the per-run geometry prelude (transform,
  is_matrix, ascent, origin, extent, font sizes), consumed via a
  structured binding so the call sites keep their local names.
- color_class(): the non-black paint-colour class suffix.
- PageBox + begin_page(): page-box dimensions, the page to_box transform
  and the `.p x# y#` class string.
- intern_font(): the font accept/reject bookkeeping shared by both
  font_family lambdas (each supplies its own per-font array growth).
- write_page_items(): the `<defs>` + paint-order SVG open/close dance
  over a variant<Line, Path> item list.
- write_header_common(): the document/head prologue with a callback for
  the mode-specific CSS rules.

Output-neutral: every reference-output document.html (dual and single
layer, all engine=odr PDFs) is byte-identical before and after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Trim redundant inline comments that restated adjacent docstrings,
tighten the two long CSS-rationale blocks (fallback font size-adjust
and .sr/.sg justify) without losing the reasoning, and hoist the
duplicated to255 channel-clamp lambda into a shared helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Comment thread src/odr/internal/html/pdf_file.cpp Outdated
Comment thread src/odr/internal/html/pdf_file.cpp Outdated
andiwand and others added 9 commits July 3, 2026 22:36
Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>
Font sizes, positions, widths, margins and spacing are a more natural
fit for pt, and the SVG viewBox already lives in PDF user-space points,
so authoring the text layer in pt drops the pt_to_px conversion
entirely. pt and px are both fixed absolute CSS units (4:3 ratio), so
rendering is unchanged.

The matrix() translation is intrinsically px, so rotated/skewed runs
carry their pt translation via a leading translate(...pt).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Drop the baseline-y stable_sort of selection lines. Sorting by y fixed
out-of-order single-column pages but interleaved multi-column layouts,
which the content stream keeps contiguous. Reading order can't be
recovered by a scalar sort key, so revert to plain stream order for now
and remove the now-dead SelLineOut.y field.

Record the proper fix (recursive XY-cut page segmentation, with a
lighter single-pass column-detection first step) in
internal/pdf/READING_ORDER.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
The visual, selection and single text layers each carried their own copy
of the same line-detection bookkeeping — an open-line index plus the
previous run's end/baseline/font/matrix — and re-implemented the same
new-line decision and previous-run update inline. That is exactly the
kind of parallel state that drifts (the reason `starts_new_line` was
already extracted).

Introduce a `LineFlow` struct holding the open-line index and previous-
run geometry, with `decide()` (new-line + margin) and `advance()`
(record predecessor). Each layer keeps its own instance — the state
footprint and downstream emission genuinely differ — but the shared gate
and update now live in one place and cannot diverge:

  - visual/single reduce to decide()/advance() almost mechanically;
  - single ORs its flow-key change onto the decision;
  - selection reuses decide()/advance() and keeps its gap test and its
    extra state (ends-space, run-start-ox, prev font-size) as locals; it
    is never close()d, preserving run contiguity across drawing ops.

No output change intended — this is a representation-preserving refactor.

Also fix a latent build break carried in the prior cleanup: the single-
layer main pass dereferenced `page` (now a reference) as `page->`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
…ering in READING_ORDER

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Undo the shared LineFlow struct: the three text layers diverge enough
(state footprint, close semantics, downstream emission) that inlining
the previous-run bookkeeping and new-line gate reads more directly than
routing through decide()/advance(). Behavior is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
andiwand and others added 3 commits July 4, 2026 01:04
…reak space

The single-layer HTML collapse test requires a 1:1 alignment between a run's
character codes and its Unicode text (`utf8_length(text) == advances.size()`).
Space inference prepends an inferred `U+0020` to `text` without a backing code
or advance, so every run that recovers a leading word-break space failed that
test outright and was painted via PUA glyphs (generated content + embedded
font) instead of collapsing to real, natively selectable Unicode. Because
almost every word in running text follows a space break, this affected nearly
all body text: on 978-3-030-65771-0 it was 63302 of 63304 unclean runs.

Mark the inferred space explicitly (`TextElement::leading_space_inferred`, set
where `show()` injects it) and make the collapse test — and the frequency
pre-pass, which previously excluded these runs from voting on the winning glyph
— align the codes against the run text *after* that space. A collapsing run
that carries one emits the space as a zero-width selectable overlay (like the
dual layer's spacer) rather than visible text, so `white-space:pre` cannot
shift the glyphs off their placement origin while copy/search still read the
recovered space. PUA is left to its real purpose: genuine glyph/Unicode
conflicts and `no_unicode` runs.

Effect on that document: unclean (PUA) runs 63304 -> 2 (the two remaining are
`no_unicode`); document.html 13.3M -> 11.5M. Single-layer reference outputs
need regenerating.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant