Skip to content

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error#9

Open
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:fix/output-csv-html-md-serializers-and-xlsx-rejection
Open

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error#9
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:fix/output-csv-html-md-serializers-and-xlsx-rejection

Conversation

@anil-bd
Copy link
Copy Markdown

@anil-bd anil-bd commented May 25, 2026

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error

The bug (silent data corruption)

-o file.csv, -o file.html, and -o file.md silently write pretty-printed JSON to disk. The extension is honored by format_from_ext() but not by serialize():

// src/utils/output.ts (pre-fix)
type Output_format = 'markdown'|'json'|'pretty'|'html'|'csv'|'raw';

const serialize = (data: unknown, fmt: Output_format): string=>{
    if (fmt == 'pretty') return JSON.stringify(data, null, 2);
    if (fmt == 'json')   return JSON.stringify(data);
    if (typeof data == 'string') return data;
    return JSON.stringify(data, null, 2);   // ← csv/html/markdown land here
};

Repro on this branch's main baseline:

bdata scraper run c_xxx https://example.com/p/1 -o out.csv
file out.csv   # → "JSON data" (not CSV)

.xlsx is even worse: format_from_ext() doesn't include it, so it falls through to 'raw' and writes JSON with an .xlsx extension. Excel refuses to open the file with a confusing "file format is not valid" error.

Why this matters now

The YouTube creator brief for Scraper Studio (the macro tech YouTuber engagement) lists "real output downloaded (JSON, CSV, or XLSX)" as a non-negotiable sponsor must-hit. Today the influencer can demo JSON, but any attempt at CSV/XLSX silently produces a broken file on camera. The take dies live.

Fix

src/utils/output.ts:

  • serialize_csv, array-of-objects → RFC 4180 CSV. Union of keys across heterogeneous rows. Embedded comma / quote / newline escaping via csv_escape. Nested values are JSON-encoded inside the cell.
  • serialize_markdown, array-of-objects → pipe-table. Cell pipes are backslash-escaped, newlines collapsed to spaces. Non-tabular data falls back to a fenced ```json block (still valid Markdown).
  • serialize_html, array-of-objects → minimal <table>. HTML entities (& < > ") escaped. Non-tabular data falls back to <pre>-wrapped escaped JSON.
  • format_from_ext, rejects .xlsx and .xls up front with fail() + a message that points to either (a) --pretty -o file.json + xlsx-cli, or (b) the web-UI download button. Hard fail beats silent corruption; no extra dependency added.
  • print(), no behavior change; serialize() now does the right thing for each format. Existing JSON/pretty/raw paths preserved exactly.

to_rows() and collect_keys() are factored helpers so the three new serializers share one tabularization path and present the same column order.

Why no XLSX support

Adding XLSX would require the xlsx npm package (~120 KB minified + zlib). For a CLI primarily used in pipelines that pipe JSON forward, that's a heavy dependency for a format better handled by the web UI's purpose-built exporter or by a downstream tool. The error message tells users both options. Happy to add it behind a peer-dep if the team wants the symmetry.

Tests

17 new vitest cases in src/__tests__/utils/output.test.ts:

  • CSV: array-of-objects header + rows, comma/quote/newline escaping, key union across heterogeneous rows, single-object wrap, nested-value JSON encoding (5 tests)
  • Markdown: pipe-table rendering, pipe/newline escaping, fenced-JSON fallback (3)
  • HTML: table rendering with HTML entity escaping, string fallback (2)
  • format_from_ext: known extension mapping, unknown returns null, .xlsx rejection with message + exit 1 (3)
  • print() end-to-end: writes correct content for .csv, .html, .md, .json, direct regression coverage for the silent-JSON bug (4)
npx vitest run src/__tests__/utils/output.test.ts
# Test Files  1 passed (1)
#      Tests  17 passed (17)

Full suite: 190 passed / 9 failed. The 9 failures (in scrape, browser, discover, add-mcp tests) all exist on main before this change, confirmed by running the suite on the stashed baseline. Happy to file them as a separate issue.

What this does NOT do

  • Does not change any public CLI flag, behavior, or output where the format was already correct (-o foo.json, --json, --pretty, stdout).
  • Does not add a new dependency.
  • Does not touch the scraper, scrape, browser, pipelines, discover, search, or any command file, print()'s contract is the boundary.

Refs

  • Audit log: audit-log-scraper-studio/_audit-log.md (prior round, May 18 2026), broken format was observed but not isolated
  • Audit issues: N1 (CSV silent JSON), N2 (XLSX silent JSON) in scraper-studio-cli-demo/ISSUES.md
  • YouTube creator brief: "Show real output downloaded (JSON, CSV, or XLSX), not a screenshot." (sponsor must-hits, p.5)

…rror

Before this change, --output paths with .csv, .html, or .md extensions
silently wrote pretty-printed JSON to disk. format_from_ext() mapped the
extension to the correct Output_format, but serialize() only handled
'pretty', 'json', and string values — every other format fell through to
JSON.stringify(data, null, 2).

This breaks the documented contract ("Output file format from extension")
and corrupts downstream consumers: opening a .csv in Excel or a DataFrame
loader fails or yields a single column of JSON. The video creator brief for
Scraper Studio explicitly promises "JSON, CSV, or XLSX" output — today
only JSON works.

Changes
- serialize_csv: array-of-objects → RFC 4180 CSV with header row, embedded
  comma/quote/newline escaping, union of keys across heterogeneous rows
- serialize_markdown: array-of-objects → pipe-table; non-tabular data
  falls back to a fenced JSON block
- serialize_html: array-of-objects → minimal <table>; non-tabular data
  falls back to <pre>JSON</pre>; HTML-special chars escaped
- format_from_ext: rejects .xlsx / .xls up front with a clear message
  pointing to (a) --pretty -o file.json + xlsx-cli, or (b) the web UI's
  download button. Hard fail beats silent corruption.
- print(): no behavior change; serialize() now does the right thing for
  csv/html/markdown.

Tests
- 17 new tests in src/__tests__/utils/output.test.ts covering CSV escaping,
  key union across heterogeneous rows, markdown pipe-escaping, HTML
  entity escaping, the xlsx rejection, and end-to-end print() writes for
  each extension (regression coverage for the silent-JSON bug).

No public-API change; all existing exports preserved.

Refs: docs/audit DX issues N1, N2 (see scraper-studio-cli-demo/ISSUES.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant