font: return std::string from serializers, drop istream inputs#580
Open
andiwand wants to merge 17 commits into
Open
font: return std::string from serializers, drop istream inputs#580andiwand wants to merge 17 commits into
andiwand wants to merge 17 commits into
Conversation
Introduces a `PdfTextMode` enum with two values: - `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode selection/search layer. Default. - `single_layer`: single combined layer with frequency-based Unicode mapping, similar to pdf2htmlEX. The active mode is controlled by `HtmlConfig::pdf_text_mode`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the single-glyph-per-absolute-span approach with two modes,
both using line blocks (position:absolute on the line div, margin-left
on inline run spans) instead of per-glyph absolute positioning.
Dual-layer mode (default, PdfTextMode::dual_layer):
- Visual layer (<div class="vis" aria-hidden>): paint-order glyph
rendering. Fonts re-encoded to PUA. Invisible text omitted.
- Selection layer (<div class="sel">): transparent real-Unicode text.
Runs grouped into line blocks by baseline; space detection inserts
gap spans. Each run span is display:inline-block with CSS justify
(text-align:justify; text-align-last:justify; text-justify:inter-
character) so characters fill the PDF advance without JavaScript.
- Similar approach to pdf.js.
Single-layer mode (PdfTextMode::single_layer):
- One combined layer per page in paint order.
- Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences
per font, then picks the most-frequent glyph as the cmap entry —
so the common case wins, not first-come-first-serve.
- Clean runs (all uchar→glyph pairs match the winner) render the real
Unicode directly in the embedded font — natively selectable.
- Unclean runs paint glyphs via ::before{content:attr(data-g)} with
a zero-width display:inline-block overlay span for selectability.
- PUA-only chars (no Unicode mapping) remain visible but unselectable.
- Similar approach to pdf2htmlEX.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`, `escape_markup`) and a template `handle_graphic_element` replace the copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs). The single-layer `add_class` captures `styles` from scope to match the dual-layer signature; `AtomicStyles styles` is moved up before the pre- pass so the capture is valid. Two dual-layer correctness fixes (from code-review): - Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero, so embedded glyphs space correctly for PDFs with custom char/word spacing. - Move vis_prev_* state updates inside the `if (!invisible)` block so invisible/clip-mode runs do not shift the next visible run's position. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Adds a standalone test that translates style-various-1.pdf through both dual_layer and single_layer modes and asserts the output document.html contains the expected marker classes (vis+sel for dual, line-block t for single). Prevents silent regressions if a mode is broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
pt_to_px/pt_to_in, the SFNT/CFF usability probe, the fvN/fnN class helper, the run's left/top-or-matrix placement classes, and the post-pass font-face/style writer were each copy-pasted between the dual-layer and single-layer paths. Hoist them into shared statics (add_position_classes, font_is_usable, font_class, write_font_face) used by both. Verified byte-identical document.html output for both PdfTextMode values across several PDF fixtures before/after. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Tight-continuation runs were merged into the previous .sr span's text without recomputing its declared width, leaving it at the first sub-run's width while the visible text grew arbitrarily longer (e.g. "Particle Acceleration and Detection" declared 10px wide). Track each open run's starting x-offset and re-derive the width on merge. Also propagate font-size to the selection layer (runs, gap spacers, and the trailing space that closes a line), which previously inherited the browser default and could overflow/clip against the PDF-derived width, desyncing the invisible hit-test text from the true glyph run.
…rder .sg (gap spacer) lacked the overflow:hidden that .sr (text run) has; per CSS an inline-block's baseline is its content's text baseline when overflow is visible but the bottom margin edge otherwise, so the two box types baseline-aligned differently within the same line, visibly shifting spaces in y. Give .sg the same overflow:hidden. Also content-stream order doesn't always run top-to-bottom (margins, columns), which made drag-selection highlight rows inconsistently. Stable-sort each page's selection lines by baseline y after the page is fully processed, keeping content-stream (x) order intact for lines on the same row.
…cmap Large glyph counts exceeded the 6400-slot BMP PUA and threw. Spill the overflow into Supplementary PUA-A and emit a format-12 cmap subtable to cover the beyond-BMP code points, clamping OS/2 usFirst/usLastCharIndex to 0xFFFF. Also add configurable dual-layer selection fallback fonts and a size-adjust so the invisible selection text widths track the PDF boxes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Correctness: - Treat pure 180° rotation (a=d=-1) as a matrix transform by also requiring m.a > 0 for the axis-aligned fast path; previously it fed a negative m.a into font-size and the left/top math. Both modes. - Guard dual-layer visual word-spacing: it is inert on PUA glyph runs (which never emit a literal space) and must skip composite fonts (PDF Tw applies only to single-byte code 32), matching single-layer. - Measure the selection-layer line-break against the previous run's font size, not the current run's, so it can't drift from the visual and single-layer heuristics. Extracted a shared starts_new_line(). - Quantize the selection-line sort key to 0.1px so float-noise baselines on the same row don't reorder same-row lines. Cleanups: - SingleRunOut::color stores the class name without a leading space. - Collapse-check loop breaks early and drops a redundant text.font check. - Unify class prefixes: ws = word-spacing everywhere (w = width). - Comment escape_markup (why not html::escape_text) and the pre-pass double parse. Test: emit the single-layer PdfTextMode alongside the dual-layer output for one representative PDF under a `-single` suffix, so both text modes are covered by reference-output diffing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Extract the byte-identical machinery the two PdfTextMode orchestrators duplicated into static helpers, so each mode reduces to its actual essence (grouping policy + span emission) and the shared logic has a single source of truth (structurally preventing drift like the earlier line-break / 180°-rotation divergences): - RunGeometry + run_geometry(): the per-run geometry prelude (transform, is_matrix, ascent, origin, extent, font sizes), consumed via a structured binding so the call sites keep their local names. - color_class(): the non-black paint-colour class suffix. - PageBox + begin_page(): page-box dimensions, the page to_box transform and the `.p x# y#` class string. - intern_font(): the font accept/reject bookkeeping shared by both font_family lambdas (each supplies its own per-font array growth). - write_page_items(): the `<defs>` + paint-order SVG open/close dance over a variant<Line, Path> item list. - write_header_common(): the document/head prologue with a callback for the mode-specific CSS rules. Output-neutral: every reference-output document.html (dual and single layer, all engine=odr PDFs) is byte-identical before and after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Trim redundant inline comments that restated adjacent docstrings, tighten the two long CSS-rationale blocks (fallback font size-adjust and .sr/.sg justify) without losing the reasoning, and hoist the duplicated to255 channel-clamp lambda into a shared helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Every caller of SfntFont::write / build_sfnt / cff::wrap_to_otf immediately funneled the ostream through an ostringstream to recover the bytes, and build_sfnt already assembled the whole file in memory before the single write — so the ostream bought nothing. Return std::string directly and drop the boilerplate (and one buffer copy) at every call site. Symmetrically, now that construction takes std::string, remove the unique_ptr<istream> constructors: CffFont's was dead, and SfntFont's two callers can read the file stream into a string themselves via util::stream::read. Prune the now-unused stream includes throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Orthogonal cleanup split off from the PDF text-selection work. Now that all font inputs are
std::string, the stream-based outputs no longer earn their keep.Output side —
SfntFont::write(),build_sfnt(...), andcff::wrap_to_otfnow returnstd::string:std::ostreamthrough anostringstreamjust to recover the bytes.build_sfntassembled the entire file in memory (header + directory + all table bodies) before the single write, so the ostream provided no streaming/memory benefit — it was astd::stringbuilder wearing an ostream costume.SfntFont::writesymmetric with itsstd::stringconstructor and consistent withwrap_to_otf.Input side — removed the
unique_ptr<istream>constructors:CffFont's was dead code (all CFFs are built fromstd::string).SfntFont's two callers now read the file stream into a string themselves viautil::stream::read, leaving thestd::stringctor as the single entry point.Pruned the now-unused
<ostream>/<istream>/<sstream>/<iosfwd>/<memory>/stream_util.hppincludes throughout, and updated the four tests that constructed fonts viaistringstream.Net −50 lines.
Testing
Full font/pdf test suite green (59 tests).
🤖 Generated with Claude Code