fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error#9
Open
anil-bd wants to merge 1 commit into
Open
Conversation
…rror
Before this change, --output paths with .csv, .html, or .md extensions
silently wrote pretty-printed JSON to disk. format_from_ext() mapped the
extension to the correct Output_format, but serialize() only handled
'pretty', 'json', and string values — every other format fell through to
JSON.stringify(data, null, 2).
This breaks the documented contract ("Output file format from extension")
and corrupts downstream consumers: opening a .csv in Excel or a DataFrame
loader fails or yields a single column of JSON. The video creator brief for
Scraper Studio explicitly promises "JSON, CSV, or XLSX" output — today
only JSON works.
Changes
- serialize_csv: array-of-objects → RFC 4180 CSV with header row, embedded
comma/quote/newline escaping, union of keys across heterogeneous rows
- serialize_markdown: array-of-objects → pipe-table; non-tabular data
falls back to a fenced JSON block
- serialize_html: array-of-objects → minimal <table>; non-tabular data
falls back to <pre>JSON</pre>; HTML-special chars escaped
- format_from_ext: rejects .xlsx / .xls up front with a clear message
pointing to (a) --pretty -o file.json + xlsx-cli, or (b) the web UI's
download button. Hard fail beats silent corruption.
- print(): no behavior change; serialize() now does the right thing for
csv/html/markdown.
Tests
- 17 new tests in src/__tests__/utils/output.test.ts covering CSV escaping,
key union across heterogeneous rows, markdown pipe-escaping, HTML
entity escaping, the xlsx rejection, and end-to-end print() writes for
each extension (regression coverage for the silent-JSON bug).
No public-API change; all existing exports preserved.
Refs: docs/audit DX issues N1, N2 (see scraper-studio-cli-demo/ISSUES.md)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error
The bug (silent data corruption)
-o file.csv,-o file.html, and-o file.mdsilently write pretty-printed JSON to disk. The extension is honored byformat_from_ext()but not byserialize():Repro on this branch's
mainbaseline:bdata scraper run c_xxx https://example.com/p/1 -o out.csv file out.csv # → "JSON data" (not CSV).xlsxis even worse:format_from_ext()doesn't include it, so it falls through to'raw'and writes JSON with an.xlsxextension. Excel refuses to open the file with a confusing "file format is not valid" error.Why this matters now
The YouTube creator brief for Scraper Studio (the macro tech YouTuber engagement) lists "real output downloaded (JSON, CSV, or XLSX)" as a non-negotiable sponsor must-hit. Today the influencer can demo JSON, but any attempt at CSV/XLSX silently produces a broken file on camera. The take dies live.
Fix
src/utils/output.ts:serialize_csv, array-of-objects → RFC 4180 CSV. Union of keys across heterogeneous rows. Embedded comma / quote / newline escaping viacsv_escape. Nested values are JSON-encoded inside the cell.serialize_markdown, array-of-objects → pipe-table. Cell pipes are backslash-escaped, newlines collapsed to spaces. Non-tabular data falls back to a fenced```jsonblock (still valid Markdown).serialize_html, array-of-objects → minimal<table>. HTML entities (& < > ") escaped. Non-tabular data falls back to<pre>-wrapped escaped JSON.format_from_ext, rejects.xlsxand.xlsup front withfail()+ a message that points to either (a)--pretty -o file.json+xlsx-cli, or (b) the web-UI download button. Hard fail beats silent corruption; no extra dependency added.print(), no behavior change;serialize()now does the right thing for each format. Existing JSON/pretty/raw paths preserved exactly.to_rows()andcollect_keys()are factored helpers so the three new serializers share one tabularization path and present the same column order.Why no XLSX support
Adding XLSX would require the
xlsxnpm package (~120 KB minified + zlib). For a CLI primarily used in pipelines that pipe JSON forward, that's a heavy dependency for a format better handled by the web UI's purpose-built exporter or by a downstream tool. The error message tells users both options. Happy to add it behind a peer-dep if the team wants the symmetry.Tests
17 new vitest cases in
src/__tests__/utils/output.test.ts:format_from_ext: known extension mapping, unknown returns null,.xlsxrejection with message + exit 1 (3)print()end-to-end: writes correct content for.csv,.html,.md,.json, direct regression coverage for the silent-JSON bug (4)Full suite: 190 passed / 9 failed. The 9 failures (in
scrape,browser,discover,add-mcptests) all exist onmainbefore this change, confirmed by running the suite on the stashed baseline. Happy to file them as a separate issue.What this does NOT do
-o foo.json,--json,--pretty, stdout).scraper,scrape,browser,pipelines,discover,search, or any command file,print()'s contract is the boundary.Refs
audit-log-scraper-studio/_audit-log.md(prior round, May 18 2026), broken format was observed but not isolatedscraper-studio-cli-demo/ISSUES.md