Skip to content

feat(scraper-run): multi-URL input via --urls and --input-file#8

Open
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:feat/scraper-run-multi-url
Open

feat(scraper-run): multi-URL input via --urls and --input-file#8
anil-bd wants to merge 1 commit into
brightdata:mainfrom
anil-bd:feat/scraper-run-multi-url

Conversation

@anil-bd
Copy link
Copy Markdown

@anil-bd anil-bd commented May 25, 2026

Summary

bdata scraper run accepted only one URL per call. The reference Scraper Studio SDKs (Node + Python) treat batch input as the default pattern — they default to a 3-URL SAMPLE_URLS array and ship triggerWithUrls(urls) / trigger_with_urls(urls) helpers that POST the whole array to /dca/trigger in one request. The CLI was the outlier.

This PR exposes that path. A list of URLs becomes one API call, one snapshot, one merged result array.

New flags on scraper run

Flag Description
--urls "u1,u2,..." Comma-separated list of URLs
--input-file <path> File with URLs — one per line (# comments + blanks skipped), OR a JSON array of strings, OR a JSON array of {"url": "..."} objects (auto-detected by first char)

Positional <url> is now optional but otherwise unchanged. Exactly one input source must be provided; combining sources errors with only one input source.

Routing

Input Path Endpoint
0 URLs error
1 URL (positional, or --urls / --input-file with one entry) existing single-URL flow /dca/trigger_immediate/dca/get_result (or /dca/crawl with --sync)
2+ URLs (--urls / --input-file) new multi-URL batch single POST to /dca/trigger with array body → poll /dca/dataset

--sync is rejected when combined with multi-URL — /dca/crawl accepts only one URL server-side. Clear error message: --sync cannot be combined with --urls / --input-file.

The auto-fallback to /dca/trigger on realtime page-limit errors is unchanged.

Backward compatibility

  • Existing bdata scraper run <id> <url> calls behave identically.
  • All 45 pre-existing scraper tests pass unchanged.
  • The run_batch helper was generalized from url: string to urls: string[]; its only two callers (sync and async fallback paths) wrap the single URL in an array — same wire shape as before ([{"url": "..."}]).

Tests

27 new cases (72 total in scraper.test.ts):

  • is_valid_url, parse_urls_arg — input parsing primitives
  • read_input_file — newline txt, JSON array of strings, JSON array of {url} objects, # comments, malformed JSON, non-array JSON, missing file, empty file
  • resolve_run_inputs — positional / --urls / --input-file happy paths, mutual exclusion, empty after parsing, invalid URL surfaced by name
  • handle_run_scraper multi-URL — correct endpoint (/dca/trigger), correct array body ([{url}, {url}, {url}]), --sync rejection with clear message, missing-input rejection, single URL via --urls still uses the legacy single path

Docs

  • README.mdscraper run section rewritten with the new flags, the routing table, and three new examples (--urls, --input-file txt, --input-file JSON).

Out of scope (suggested follow-up)

bdata pipelines <type> has the same gap — same underlying /dca/trigger endpoint, same single-URL CLI surface. Worth a parallel PR if there's appetite.

Test plan

  • tsc --noEmit clean
  • vitest run src/__tests__/commands/scraper.test.ts — 72/72 pass
  • Smoke test against a real collector: bdata scraper run <id> --urls "u1,u2,u3" --pretty returns 3 records in a single array
  • Smoke test --input-file urls.txt (txt) and urls.json (JSON array)
  • Smoke test --sync --urls "..." returns the rejection error without making any API call

🤖 Generated with Claude Code

`bdata scraper run` accepted only one URL per invocation; for N URLs
users had to spawn N processes (or N HTTP calls), each producing its
own snapshot ID and its own poll loop. The underlying
`POST /dca/trigger` endpoint natively accepts an array body, and the
official Scraper Studio reference SDKs (Node + Python) ship this as
their canonical helper:

  - https://github.com/brightdata/bright-data-scraper-studio-nodejs-project
    → triggerWithUrls(urls)
  - https://github.com/brightdata/bright-data-scraper-studio-python-project
    → trigger_with_urls(urls)

This change exposes that path in the CLI so a list of URLs becomes
one API call, one snapshot, one merged result array.

New flags on `scraper run`:

  --urls "u1,u2,..."    Comma-separated list of URLs.
  --input-file <path>   File with URLs — one per line (# comments
                        and blank lines skipped), OR a JSON array
                        of URL strings, OR a JSON array of {url}
                        objects (auto-detected by first char).

The positional `<url>` argument is now optional but otherwise
unchanged. Exactly one input source must be provided; passing more
than one errors with "only one input source".

Routing:
  - 0 URLs                → error, "requires one of: <url>, --urls, --input-file"
  - 1 URL  (any source)   → existing single-URL path (trigger_immediate
                            → poll get_result, or --sync /dca/crawl).
                            Auto-fallback to /dca/trigger on realtime
                            page-limit error stays in place.
  - 2+ URLs (--urls / --input-file)
                          → single POST /dca/trigger with array body,
                            poll /dca/dataset for the merged array.
                            `--sync` is rejected here with a clear
                            error since /dca/crawl is single-URL only
                            server-side.

Tests: 27 new cases covering parse_urls_arg, read_input_file
(txt/JSON shapes + comments + malformed JSON + non-array JSON),
resolve_run_inputs (mutual exclusion + empty + invalid URL), and
the multi-URL handle_run_scraper flow (correct endpoint, correct
array body, --sync rejection, single-URL via --urls still uses the
legacy path).

Backward compatible: all 45 existing scraper tests pass unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant