feat(scraper-run): multi-URL input via --urls and --input-file#8
Open
anil-bd wants to merge 1 commit into
Open
feat(scraper-run): multi-URL input via --urls and --input-file#8anil-bd wants to merge 1 commit into
anil-bd wants to merge 1 commit into
Conversation
`bdata scraper run` accepted only one URL per invocation; for N URLs users had to spawn N processes (or N HTTP calls), each producing its own snapshot ID and its own poll loop. The underlying `POST /dca/trigger` endpoint natively accepts an array body, and the official Scraper Studio reference SDKs (Node + Python) ship this as their canonical helper: - https://github.com/brightdata/bright-data-scraper-studio-nodejs-project → triggerWithUrls(urls) - https://github.com/brightdata/bright-data-scraper-studio-python-project → trigger_with_urls(urls) This change exposes that path in the CLI so a list of URLs becomes one API call, one snapshot, one merged result array. New flags on `scraper run`: --urls "u1,u2,..." Comma-separated list of URLs. --input-file <path> File with URLs — one per line (# comments and blank lines skipped), OR a JSON array of URL strings, OR a JSON array of {url} objects (auto-detected by first char). The positional `<url>` argument is now optional but otherwise unchanged. Exactly one input source must be provided; passing more than one errors with "only one input source". Routing: - 0 URLs → error, "requires one of: <url>, --urls, --input-file" - 1 URL (any source) → existing single-URL path (trigger_immediate → poll get_result, or --sync /dca/crawl). Auto-fallback to /dca/trigger on realtime page-limit error stays in place. - 2+ URLs (--urls / --input-file) → single POST /dca/trigger with array body, poll /dca/dataset for the merged array. `--sync` is rejected here with a clear error since /dca/crawl is single-URL only server-side. Tests: 27 new cases covering parse_urls_arg, read_input_file (txt/JSON shapes + comments + malformed JSON + non-array JSON), resolve_run_inputs (mutual exclusion + empty + invalid URL), and the multi-URL handle_run_scraper flow (correct endpoint, correct array body, --sync rejection, single-URL via --urls still uses the legacy path). Backward compatible: all 45 existing scraper tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bdata scraper runaccepted only one URL per call. The reference Scraper Studio SDKs (Node + Python) treat batch input as the default pattern — they default to a 3-URLSAMPLE_URLSarray and shiptriggerWithUrls(urls)/trigger_with_urls(urls)helpers that POST the whole array to/dca/triggerin one request. The CLI was the outlier.This PR exposes that path. A list of URLs becomes one API call, one snapshot, one merged result array.
triggerWithUrls(urls)trigger_with_urls(urls)New flags on
scraper run--urls "u1,u2,..."--input-file <path>{"url": "..."}objects (auto-detected by first char)Positional
<url>is now optional but otherwise unchanged. Exactly one input source must be provided; combining sources errors withonly one input source.Routing
--urls/--input-filewith one entry)/dca/trigger_immediate→/dca/get_result(or/dca/crawlwith--sync)--urls/--input-file)/dca/triggerwith array body → poll/dca/dataset--syncis rejected when combined with multi-URL —/dca/crawlaccepts only one URL server-side. Clear error message:--sync cannot be combined with --urls / --input-file.The auto-fallback to
/dca/triggeron realtime page-limit errors is unchanged.Backward compatibility
bdata scraper run <id> <url>calls behave identically.run_batchhelper was generalized fromurl: stringtourls: string[]; its only two callers (sync and async fallback paths) wrap the single URL in an array — same wire shape as before ([{"url": "..."}]).Tests
27 new cases (72 total in
scraper.test.ts):is_valid_url,parse_urls_arg— input parsing primitivesread_input_file— newline txt, JSON array of strings, JSON array of{url}objects,#comments, malformed JSON, non-array JSON, missing file, empty fileresolve_run_inputs— positional /--urls/--input-filehappy paths, mutual exclusion, empty after parsing, invalid URL surfaced by namehandle_run_scrapermulti-URL — correct endpoint (/dca/trigger), correct array body ([{url}, {url}, {url}]),--syncrejection with clear message, missing-input rejection, single URL via--urlsstill uses the legacy single pathDocs
README.md—scraper runsection rewritten with the new flags, the routing table, and three new examples (--urls,--input-filetxt,--input-fileJSON).Out of scope (suggested follow-up)
bdata pipelines <type>has the same gap — same underlying/dca/triggerendpoint, same single-URL CLI surface. Worth a parallel PR if there's appetite.Test plan
tsc --noEmitcleanvitest run src/__tests__/commands/scraper.test.ts— 72/72 passbdata scraper run <id> --urls "u1,u2,u3" --prettyreturns 3 records in a single array--input-file urls.txt(txt) andurls.json(JSON array)--sync --urls "..."returns the rejection error without making any API call🤖 Generated with Claude Code