Skip to content

font: return std::string from serializers, drop istream inputs#580

Open
andiwand wants to merge 17 commits into
mainfrom
font-return-string
Open

font: return std::string from serializers, drop istream inputs#580
andiwand wants to merge 17 commits into
mainfrom
font-return-string

Conversation

@andiwand

@andiwand andiwand commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Orthogonal cleanup split off from the PDF text-selection work. Now that all font inputs are std::string, the stream-based outputs no longer earn their keep.

Output sideSfntFont::write(), build_sfnt(...), and cff::wrap_to_otf now return std::string:

  • Every caller already funneled the std::ostream through an ostringstream just to recover the bytes.
  • build_sfnt assembled the entire file in memory (header + directory + all table bodies) before the single write, so the ostream provided no streaming/memory benefit — it was a std::string builder wearing an ostream costume.
  • Dropping it removes the boilerplate and one buffer copy at all 8 call sites, and makes SfntFont::write symmetric with its std::string constructor and consistent with wrap_to_otf.

Input side — removed the unique_ptr<istream> constructors:

  • CffFont's was dead code (all CFFs are built from std::string).
  • SfntFont's two callers now read the file stream into a string themselves via util::stream::read, leaving the std::string ctor as the single entry point.

Pruned the now-unused <ostream>/<istream>/<sstream>/<iosfwd>/<memory>/stream_util.hpp includes throughout, and updated the four tests that constructed fonts via istringstream.

Net −50 lines.

Testing

Full font/pdf test suite green (59 tests).

🤖 Generated with Claude Code

andiwand and others added 17 commits July 1, 2026 19:06
Introduces a `PdfTextMode` enum with two values:
- `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode
  selection/search layer. Default.
- `single_layer`: single combined layer with frequency-based Unicode
  mapping, similar to pdf2htmlEX.

The active mode is controlled by `HtmlConfig::pdf_text_mode`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the single-glyph-per-absolute-span approach with two modes,
both using line blocks (position:absolute on the line div, margin-left
on inline run spans) instead of per-glyph absolute positioning.

Dual-layer mode (default, PdfTextMode::dual_layer):
- Visual layer (<div class="vis" aria-hidden>): paint-order glyph
  rendering. Fonts re-encoded to PUA. Invisible text omitted.
- Selection layer (<div class="sel">): transparent real-Unicode text.
  Runs grouped into line blocks by baseline; space detection inserts
  gap spans. Each run span is display:inline-block with CSS justify
  (text-align:justify; text-align-last:justify; text-justify:inter-
  character) so characters fill the PDF advance without JavaScript.
- Similar approach to pdf.js.

Single-layer mode (PdfTextMode::single_layer):
- One combined layer per page in paint order.
- Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences
  per font, then picks the most-frequent glyph as the cmap entry —
  so the common case wins, not first-come-first-serve.
- Clean runs (all uchar→glyph pairs match the winner) render the real
  Unicode directly in the embedded font — natively selectable.
- Unclean runs paint glyphs via ::before{content:attr(data-g)} with
  a zero-width display:inline-block overlay span for selectability.
- PUA-only chars (no Unicode mapping) remain visible but unselectable.
- Similar approach to pdf2htmlEX.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`,
`escape_markup`) and a template `handle_graphic_element` replace the
copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs).
The single-layer `add_class` captures `styles` from scope to match the
dual-layer signature; `AtomicStyles styles` is moved up before the pre-
pass so the capture is valid.

Two dual-layer correctness fixes (from code-review):
- Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero,
  so embedded glyphs space correctly for PDFs with custom char/word
  spacing.
- Move vis_prev_* state updates inside the `if (!invisible)` block so
  invisible/clip-mode runs do not shift the next visible run's position.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Adds a standalone test that translates style-various-1.pdf through both
dual_layer and single_layer modes and asserts the output document.html
contains the expected marker classes (vis+sel for dual, line-block t
for single). Prevents silent regressions if a mode is broken.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
pt_to_px/pt_to_in, the SFNT/CFF usability probe, the fvN/fnN class
helper, the run's left/top-or-matrix placement classes, and the
post-pass font-face/style writer were each copy-pasted between the
dual-layer and single-layer paths. Hoist them into shared statics
(add_position_classes, font_is_usable, font_class, write_font_face)
used by both. Verified byte-identical document.html output for both
PdfTextMode values across several PDF fixtures before/after.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Tight-continuation runs were merged into the previous .sr span's text
without recomputing its declared width, leaving it at the first
sub-run's width while the visible text grew arbitrarily longer (e.g.
"Particle Acceleration and Detection" declared 10px wide). Track each
open run's starting x-offset and re-derive the width on merge.

Also propagate font-size to the selection layer (runs, gap spacers,
and the trailing space that closes a line), which previously inherited
the browser default and could overflow/clip against the PDF-derived
width, desyncing the invisible hit-test text from the true glyph run.
…rder

.sg (gap spacer) lacked the overflow:hidden that .sr (text run) has;
per CSS an inline-block's baseline is its content's text baseline when
overflow is visible but the bottom margin edge otherwise, so the two
box types baseline-aligned differently within the same line, visibly
shifting spaces in y. Give .sg the same overflow:hidden.

Also content-stream order doesn't always run top-to-bottom (margins,
columns), which made drag-selection highlight rows inconsistently.
Stable-sort each page's selection lines by baseline y after the page
is fully processed, keeping content-stream (x) order intact for lines
on the same row.
…cmap

Large glyph counts exceeded the 6400-slot BMP PUA and threw. Spill the
overflow into Supplementary PUA-A and emit a format-12 cmap subtable to
cover the beyond-BMP code points, clamping OS/2 usFirst/usLastCharIndex
to 0xFFFF. Also add configurable dual-layer selection fallback fonts and
a size-adjust so the invisible selection text widths track the PDF boxes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Correctness:
- Treat pure 180° rotation (a=d=-1) as a matrix transform by also
  requiring m.a > 0 for the axis-aligned fast path; previously it fed a
  negative m.a into font-size and the left/top math. Both modes.
- Guard dual-layer visual word-spacing: it is inert on PUA glyph runs
  (which never emit a literal space) and must skip composite fonts (PDF
  Tw applies only to single-byte code 32), matching single-layer.
- Measure the selection-layer line-break against the previous run's
  font size, not the current run's, so it can't drift from the visual
  and single-layer heuristics. Extracted a shared starts_new_line().
- Quantize the selection-line sort key to 0.1px so float-noise baselines
  on the same row don't reorder same-row lines.

Cleanups:
- SingleRunOut::color stores the class name without a leading space.
- Collapse-check loop breaks early and drops a redundant text.font check.
- Unify class prefixes: ws = word-spacing everywhere (w = width).
- Comment escape_markup (why not html::escape_text) and the pre-pass
  double parse.

Test: emit the single-layer PdfTextMode alongside the dual-layer output
for one representative PDF under a `-single` suffix, so both text modes
are covered by reference-output diffing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Extract the byte-identical machinery the two PdfTextMode orchestrators
duplicated into static helpers, so each mode reduces to its actual
essence (grouping policy + span emission) and the shared logic has a
single source of truth (structurally preventing drift like the earlier
line-break / 180°-rotation divergences):

- RunGeometry + run_geometry(): the per-run geometry prelude (transform,
  is_matrix, ascent, origin, extent, font sizes), consumed via a
  structured binding so the call sites keep their local names.
- color_class(): the non-black paint-colour class suffix.
- PageBox + begin_page(): page-box dimensions, the page to_box transform
  and the `.p x# y#` class string.
- intern_font(): the font accept/reject bookkeeping shared by both
  font_family lambdas (each supplies its own per-font array growth).
- write_page_items(): the `<defs>` + paint-order SVG open/close dance
  over a variant<Line, Path> item list.
- write_header_common(): the document/head prologue with a callback for
  the mode-specific CSS rules.

Output-neutral: every reference-output document.html (dual and single
layer, all engine=odr PDFs) is byte-identical before and after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Trim redundant inline comments that restated adjacent docstrings,
tighten the two long CSS-rationale blocks (fallback font size-adjust
and .sr/.sg justify) without losing the reasoning, and hoist the
duplicated to255 channel-clamp lambda into a shared helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Every caller of SfntFont::write / build_sfnt / cff::wrap_to_otf
immediately funneled the ostream through an ostringstream to recover
the bytes, and build_sfnt already assembled the whole file in memory
before the single write — so the ostream bought nothing. Return
std::string directly and drop the boilerplate (and one buffer copy)
at every call site.

Symmetrically, now that construction takes std::string, remove the
unique_ptr<istream> constructors: CffFont's was dead, and SfntFont's
two callers can read the file stream into a string themselves via
util::stream::read. Prune the now-unused stream includes throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant