font: return std::string from serializers, drop istream inputs by andiwand · Pull Request #580 · opendocument-app/OpenDocument.core

andiwand · 2026-07-03T20:39:09Z

Summary

Orthogonal cleanup split off from the PDF text-selection work. Now that all font inputs are std::string, the stream-based outputs no longer earn their keep.

Output side — SfntFont::write(), build_sfnt(...), and cff::wrap_to_otf now return std::string:

Every caller already funneled the std::ostream through an ostringstream just to recover the bytes.
build_sfnt assembled the entire file in memory (header + directory + all table bodies) before the single write, so the ostream provided no streaming/memory benefit — it was a std::string builder wearing an ostream costume.
Dropping it removes the boilerplate and one buffer copy at all 8 call sites, and makes SfntFont::write symmetric with its std::string constructor and consistent with wrap_to_otf.

Input side — removed the unique_ptr<istream> constructors:

CffFont's was dead code (all CFFs are built from std::string).
SfntFont's two callers now read the file stream into a string themselves via util::stream::read, leaving the std::string ctor as the single entry point.

Pruned the now-unused <ostream>/<istream>/<sstream>/<iosfwd>/<memory>/stream_util.hpp includes throughout, and updated the four tests that constructed fonts via istringstream.

Net −50 lines.

Testing

Full font/pdf test suite green (59 tests).

🤖 Generated with Claude Code

Introduces a `PdfTextMode` enum with two values: - `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode selection/search layer. Default. - `single_layer`: single combined layer with frequency-based Unicode mapping, similar to pdf2htmlEX. The active mode is controlled by `HtmlConfig::pdf_text_mode`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Replaces the single-glyph-per-absolute-span approach with two modes, both using line blocks (position:absolute on the line div, margin-left on inline run spans) instead of per-glyph absolute positioning. Dual-layer mode (default, PdfTextMode::dual_layer): - Visual layer (<div class="vis" aria-hidden>): paint-order glyph rendering. Fonts re-encoded to PUA. Invisible text omitted. - Selection layer (<div class="sel">): transparent real-Unicode text. Runs grouped into line blocks by baseline; space detection inserts gap spans. Each run span is display:inline-block with CSS justify (text-align:justify; text-align-last:justify; text-justify:inter- character) so characters fill the PDF advance without JavaScript. - Similar approach to pdf.js. Single-layer mode (PdfTextMode::single_layer): - One combined layer per page in paint order. - Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font, then picks the most-frequent glyph as the cmap entry — so the common case wins, not first-come-first-serve. - Clean runs (all uchar→glyph pairs match the winner) render the real Unicode directly in the embedded font — natively selectable. - Unclean runs paint glyphs via ::before{content:attr(data-g)} with a zero-width display:inline-block overlay span for selectability. - PUA-only chars (no Unicode mapping) remain visible but unselectable. - Similar approach to pdf2htmlEX. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`, `escape_markup`) and a template `handle_graphic_element` replace the copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs). The single-layer `add_class` captures `styles` from scope to match the dual-layer signature; `AtomicStyles styles` is moved up before the pre- pass so the capture is valid. Two dual-layer correctness fixes (from code-review): - Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero, so embedded glyphs space correctly for PDFs with custom char/word spacing. - Move vis_prev_* state updates inside the `if (!invisible)` block so invisible/clip-mode runs do not shift the next visible run's position. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Adds a standalone test that translates style-various-1.pdf through both dual_layer and single_layer modes and asserts the output document.html contains the expected marker classes (vis+sel for dual, line-block t for single). Prevents silent regressions if a mode is broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

pt_to_px/pt_to_in, the SFNT/CFF usability probe, the fvN/fnN class helper, the run's left/top-or-matrix placement classes, and the post-pass font-face/style writer were each copy-pasted between the dual-layer and single-layer paths. Hoist them into shared statics (add_position_classes, font_is_usable, font_class, write_font_face) used by both. Verified byte-identical document.html output for both PdfTextMode values across several PDF fixtures before/after. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Tight-continuation runs were merged into the previous .sr span's text without recomputing its declared width, leaving it at the first sub-run's width while the visible text grew arbitrarily longer (e.g. "Particle Acceleration and Detection" declared 10px wide). Track each open run's starting x-offset and re-derive the width on merge. Also propagate font-size to the selection layer (runs, gap spacers, and the trailing space that closes a line), which previously inherited the browser default and could overflow/clip against the PDF-derived width, desyncing the invisible hit-test text from the true glyph run.

…rder .sg (gap spacer) lacked the overflow:hidden that .sr (text run) has; per CSS an inline-block's baseline is its content's text baseline when overflow is visible but the bottom margin edge otherwise, so the two box types baseline-aligned differently within the same line, visibly shifting spaces in y. Give .sg the same overflow:hidden. Also content-stream order doesn't always run top-to-bottom (margins, columns), which made drag-selection highlight rows inconsistently. Stable-sort each page's selection lines by baseline y after the page is fully processed, keeping content-stream (x) order intact for lines on the same row.

…cmap Large glyph counts exceeded the 6400-slot BMP PUA and threw. Spill the overflow into Supplementary PUA-A and emit a format-12 cmap subtable to cover the beyond-BMP code points, clamping OS/2 usFirst/usLastCharIndex to 0xFFFF. Also add configurable dual-layer selection fallback fonts and a size-adjust so the invisible selection text widths track the PDF boxes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Correctness: - Treat pure 180° rotation (a=d=-1) as a matrix transform by also requiring m.a > 0 for the axis-aligned fast path; previously it fed a negative m.a into font-size and the left/top math. Both modes. - Guard dual-layer visual word-spacing: it is inert on PUA glyph runs (which never emit a literal space) and must skip composite fonts (PDF Tw applies only to single-byte code 32), matching single-layer. - Measure the selection-layer line-break against the previous run's font size, not the current run's, so it can't drift from the visual and single-layer heuristics. Extracted a shared starts_new_line(). - Quantize the selection-line sort key to 0.1px so float-noise baselines on the same row don't reorder same-row lines. Cleanups: - SingleRunOut::color stores the class name without a leading space. - Collapse-check loop breaks early and drops a redundant text.font check. - Unify class prefixes: ws = word-spacing everywhere (w = width). - Comment escape_markup (why not html::escape_text) and the pre-pass double parse. Test: emit the single-layer PdfTextMode alongside the dual-layer output for one representative PDF under a `-single` suffix, so both text modes are covered by reference-output diffing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Extract the byte-identical machinery the two PdfTextMode orchestrators duplicated into static helpers, so each mode reduces to its actual essence (grouping policy + span emission) and the shared logic has a single source of truth (structurally preventing drift like the earlier line-break / 180°-rotation divergences): - RunGeometry + run_geometry(): the per-run geometry prelude (transform, is_matrix, ascent, origin, extent, font sizes), consumed via a structured binding so the call sites keep their local names. - color_class(): the non-black paint-colour class suffix. - PageBox + begin_page(): page-box dimensions, the page to_box transform and the `.p x# y#` class string. - intern_font(): the font accept/reject bookkeeping shared by both font_family lambdas (each supplies its own per-font array growth). - write_page_items(): the `<defs>` + paint-order SVG open/close dance over a variant<Line, Path> item list. - write_header_common(): the document/head prologue with a callback for the mode-specific CSS rules. Output-neutral: every reference-output document.html (dual and single layer, all engine=odr PDFs) is byte-identical before and after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Trim redundant inline comments that restated adjacent docstrings, tighten the two long CSS-rationale blocks (fallback font size-adjust and .sr/.sg justify) without losing the reasoning, and hoist the duplicated to255 channel-clamp lambda into a shared helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Every caller of SfntFont::write / build_sfnt / cff::wrap_to_otf immediately funneled the ostream through an ostringstream to recover the bytes, and build_sfnt already assembled the whole file in memory before the single write — so the ostream bought nothing. Return std::string directly and drop the boilerplate (and one buffer copy) at every call site. Symmetrically, now that construction takes std::string, remove the unique_ptr<istream> constructors: CffFont's was dead, and SfntFont's two callers can read the file stream into a string themselves via util::stream::read. Prune the now-unused stream includes throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

andiwand and others added 17 commits July 1, 2026 19:06

revert test

d1b4527

update refs

532f94a

checkout lfs; cleanup

263cbb4

generalize tests

39c3045

cleanup

e5960c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

font: return std::string from serializers, drop istream inputs#580

font: return std::string from serializers, drop istream inputs#580
andiwand wants to merge 17 commits into
mainfrom
font-return-string

andiwand commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andiwand commented Jul 3, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant