fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error by anil-bd · Pull Request #9 · brightdata/cli

anil-bd · 2026-05-25T09:38:02Z

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error

The bug (silent data corruption)

-o file.csv, -o file.html, and -o file.md silently write pretty-printed JSON to disk. The extension is honored by format_from_ext() but not by serialize():

// src/utils/output.ts (pre-fix)
type Output_format = 'markdown'|'json'|'pretty'|'html'|'csv'|'raw';

const serialize = (data: unknown, fmt: Output_format): string=>{
    if (fmt == 'pretty') return JSON.stringify(data, null, 2);
    if (fmt == 'json')   return JSON.stringify(data);
    if (typeof data == 'string') return data;
    return JSON.stringify(data, null, 2);   // ← csv/html/markdown land here
};

Repro on this branch's main baseline:

bdata scraper run c_xxx https://example.com/p/1 -o out.csv
file out.csv   # → "JSON data" (not CSV)

.xlsx is even worse: format_from_ext() doesn't include it, so it falls through to 'raw' and writes JSON with an .xlsx extension. Excel refuses to open the file with a confusing "file format is not valid" error.

Why this matters now

The YouTube creator brief for Scraper Studio (the macro tech YouTuber engagement) lists "real output downloaded (JSON, CSV, or XLSX)" as a non-negotiable sponsor must-hit. Today the influencer can demo JSON, but any attempt at CSV/XLSX silently produces a broken file on camera. The take dies live.

Fix

src/utils/output.ts:

serialize_csv, array-of-objects → RFC 4180 CSV. Union of keys across heterogeneous rows. Embedded comma / quote / newline escaping via csv_escape. Nested values are JSON-encoded inside the cell.
serialize_markdown, array-of-objects → pipe-table. Cell pipes are backslash-escaped, newlines collapsed to spaces. Non-tabular data falls back to a fenced ```json block (still valid Markdown).
serialize_html, array-of-objects → minimal <table>. HTML entities (& < > ") escaped. Non-tabular data falls back to <pre>-wrapped escaped JSON.
format_from_ext, rejects .xlsx and .xls up front with fail() + a message that points to either (a) --pretty -o file.json + xlsx-cli, or (b) the web-UI download button. Hard fail beats silent corruption; no extra dependency added.
print(), no behavior change; serialize() now does the right thing for each format. Existing JSON/pretty/raw paths preserved exactly.

to_rows() and collect_keys() are factored helpers so the three new serializers share one tabularization path and present the same column order.

Why no XLSX support

Adding XLSX would require the xlsx npm package (~120 KB minified + zlib). For a CLI primarily used in pipelines that pipe JSON forward, that's a heavy dependency for a format better handled by the web UI's purpose-built exporter or by a downstream tool. The error message tells users both options. Happy to add it behind a peer-dep if the team wants the symmetry.

Tests

17 new vitest cases in src/__tests__/utils/output.test.ts:

CSV: array-of-objects header + rows, comma/quote/newline escaping, key union across heterogeneous rows, single-object wrap, nested-value JSON encoding (5 tests)
Markdown: pipe-table rendering, pipe/newline escaping, fenced-JSON fallback (3)
HTML: table rendering with HTML entity escaping, string fallback (2)
format_from_ext: known extension mapping, unknown returns null, .xlsx rejection with message + exit 1 (3)
print() end-to-end: writes correct content for .csv, .html, .md, .json, direct regression coverage for the silent-JSON bug (4)

npx vitest run src/__tests__/utils/output.test.ts
# Test Files  1 passed (1)
#      Tests  17 passed (17)

Full suite: 190 passed / 9 failed. The 9 failures (in scrape, browser, discover, add-mcp tests) all exist on main before this change, confirmed by running the suite on the stashed baseline. Happy to file them as a separate issue.

What this does NOT do

Does not change any public CLI flag, behavior, or output where the format was already correct (-o foo.json, --json, --pretty, stdout).
Does not add a new dependency.
Does not touch the scraper, scrape, browser, pipelines, discover, search, or any command file, print()'s contract is the boundary.

Refs

Audit log: audit-log-scraper-studio/_audit-log.md (prior round, May 18 2026), broken format was observed but not isolated
Audit issues: N1 (CSV silent JSON), N2 (XLSX silent JSON) in scraper-studio-cli-demo/ISSUES.md
YouTube creator brief: "Show real output downloaded (JSON, CSV, or XLSX), not a screenshot." (sponsor must-hits, p.5)

…rror Before this change, --output paths with .csv, .html, or .md extensions silently wrote pretty-printed JSON to disk. format_from_ext() mapped the extension to the correct Output_format, but serialize() only handled 'pretty', 'json', and string values — every other format fell through to JSON.stringify(data, null, 2). This breaks the documented contract ("Output file format from extension") and corrupts downstream consumers: opening a .csv in Excel or a DataFrame loader fails or yields a single column of JSON. The video creator brief for Scraper Studio explicitly promises "JSON, CSV, or XLSX" output — today only JSON works. Changes - serialize_csv: array-of-objects → RFC 4180 CSV with header row, embedded comma/quote/newline escaping, union of keys across heterogeneous rows - serialize_markdown: array-of-objects → pipe-table; non-tabular data falls back to a fenced JSON block - serialize_html: array-of-objects → minimal <table>; non-tabular data falls back to <pre>JSON</pre>; HTML-special chars escaped - format_from_ext: rejects .xlsx / .xls up front with a clear message pointing to (a) --pretty -o file.json + xlsx-cli, or (b) the web UI's download button. Hard fail beats silent corruption. - print(): no behavior change; serialize() now does the right thing for csv/html/markdown. Tests - 17 new tests in src/__tests__/utils/output.test.ts covering CSV escaping, key union across heterogeneous rows, markdown pipe-escaping, HTML entity escaping, the xlsx rejection, and end-to-end print() writes for each extension (regression coverage for the silent-JSON bug). No public-API change; all existing exports preserved. Refs: docs/audit DX issues N1, N2 (see scraper-studio-cli-demo/ISSUES.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error#9

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error#9
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:fix/output-csv-html-md-serializers-and-xlsx-rejection

anil-bd commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anil-bd commented May 25, 2026

fix(output): wire CSV/HTML/MD serializers, reject XLSX with helpful error

The bug (silent data corruption)

Why this matters now

Fix

Why no XLSX support

Tests

What this does NOT do

Refs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant