feat(cluster): hostname-aware SSRF guard for cluster base_url#510
Conversation
Phase 1: flag-gated async resolve-and-classify SSRF policy at cluster registration + test-connection (close the hostname bypass), literal-IP range-set expansion, metadata-hostname denylist, validator de-dup, new CLUSTER_URL_BLOCKED 400 code. Phase 2 (connect-time IP pinning for DNS rebinding) deferred to phase2_idea.md. cross-model review: Opus self-review (GPT-5.5 unreachable in sandbox). Signed-off-by: Claude <noreply@anthropic.com>
3-story epic (pure domain classifier / async service orchestrator + wiring + validator de-dup + router + tests / docs). Backend-only, no migration, no UI. Plan consistency review green; all file:line claims verified against the tree. cross-model review: Opus self-review (GPT-5.5 unreachable in sandbox). Signed-off-by: Claude <noreply@anthropic.com>
Add backend/app/domain/cluster/url_policy.py: is_blocked_address (private/ loopback/link-local/reserved/multicast/unspecified/CGNAT, IPv4-mapped IPv6 unwrap) + is_metadata_hostname + METADATA_HOSTNAMES denylist. Pure domain logic, no I/O. 28 unit tests (AC-1, AC-6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Claude <noreply@anthropic.com>
…onnection (Story 1.2) - services/cluster_url_policy.py: async assert_base_url_allowed — flag-gated (RELYLOOP_ALLOW_PRIVATE_CLUSTERS), metadata-host short-circuit, literal-IP classify, else getaddrinfo + classify every resolved addr; gaierror falls through. Defines ClusterUrlBlocked (re-exported from cluster.py to avoid a cycle). - cluster.py: call the guard before adapter build in register_cluster + test_cluster_connection. - clusters.py: map ClusterUrlBlocked -> 400 CLUSTER_URL_BLOCKED on both endpoints. - schemas.py: collapse the two duplicated base_url validators into one shared structural helper (scheme + host only); SSRF policy moved to the service. - Tests: 10 service unit (flag no-op, metadata, literal, resolved-private/public, mixed, gaierror, enforcement-before-probe), contract envelope + integration real-DNS (CI-gated), CLUSTER_URL_BLOCKED registered in test_error_codes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Claude <noreply@anthropic.com>
New docs/04_security/cluster-url-ssrf.md (control, flag gate, blocked ranges, metadata denylist, DNS-rebinding residual → Phase 2) + README contents row + cluster-lifecycle.md pre-probe gate note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements a hostname-aware SSRF guard for cluster base_urls, active when private clusters are disallowed. It introduces a pure domain classifier for blocked IP ranges and metadata hostnames, and an async service orchestrator that performs non-blocking DNS resolution. It also consolidates duplicate Pydantic validators into a single structural check. Feedback suggests adding a timeout to the DNS resolution to prevent API hangs and validating the port structure early in the schema validation to catch malformed ports.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| infos = await loop.getaddrinfo(host, None, type=socket.SOCK_STREAM) | ||
| except socket.gaierror: | ||
| return # unresolvable -> not an SSRF target; probe will report unreachable |
There was a problem hiding this comment.
To prevent potential denial of service or API hangs when resolving slow or unresponsive DNS hostnames, it is highly recommended to enforce a timeout on the DNS resolution. Wrapping loop.getaddrinfo in asyncio.wait_for with a reasonable timeout (e.g., 5.0 seconds) ensures the API remains responsive.
| try: | |
| infos = await loop.getaddrinfo(host, None, type=socket.SOCK_STREAM) | |
| except socket.gaierror: | |
| return # unresolvable -> not an SSRF target; probe will report unreachable | |
| try: | |
| infos = await asyncio.wait_for( | |
| loop.getaddrinfo(host, None, type=socket.SOCK_STREAM), | |
| timeout=5.0, | |
| ) | |
| except (socket.gaierror, asyncio.TimeoutError): | |
| return # unresolvable or timeout -> not an SSRF target; probe will report unreachable |
| parsed = urlparse(v) | ||
| if parsed.scheme not in ("http", "https"): | ||
| raise ValueError("base_url must use http or https scheme") | ||
| if not parsed.hostname: | ||
| raise ValueError("base_url must include a host") | ||
| return v |
There was a problem hiding this comment.
Accessing parsed.port during structural validation ensures that any malformed port (e.g., http://elasticsearch:9200abc) is caught early and rejected with a 422 VALIDATION_ERROR rather than propagating to the service layer and causing unexpected failures later.
| parsed = urlparse(v) | |
| if parsed.scheme not in ("http", "https"): | |
| raise ValueError("base_url must use http or https scheme") | |
| if not parsed.hostname: | |
| raise ValueError("base_url must include a host") | |
| return v | |
| parsed = urlparse(v) | |
| if parsed.scheme not in ("http", "https"): | |
| raise ValueError("base_url must use http or https scheme") | |
| if not parsed.hostname: | |
| raise ValueError("base_url must include a host") | |
| try: | |
| _ = parsed.port | |
| except ValueError as exc: | |
| raise ValueError("base_url has an invalid port") from exc | |
| return v |
…view) - cluster_url_policy.py: wrap getaddrinfo in asyncio.wait_for(5s); a slow/ blackholed resolver now times out and falls through (not an SSRF hit) instead of hanging the request (Gemini #1). - schemas.py: structural validator accesses parsed.port so a malformed port (e.g. ':9200abc') surfaces as 422 VALIDATION_ERROR early (Gemini #2). - Tests: DNS-timeout fall-through (service unit) + malformed-port 422 (contract). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Claude <noreply@anthropic.com>
Review adjudication (Gemini Code Assist)Commit landing fixes: Gemini Code Assist (2 findings)
Outcomes
Local gate after fixes: 52 targeted tests pass · Cross-model review: Opus self-review (GPT-5.5 unreachable in the remote sandbox); Gemini is the live cross-family gate. Generated by Claude Code |
Finalization bookkeeping for the cluster base_url SSRF guard (Phase 1, #510): pipeline_status + plan status → Complete, state.md last-5-merges + context refresh, MVP2 dashboard + public roadmap regen. Folder retained in planned_features/ — Phase 2 (connect-time IP pinning) still tracked in phase2_idea.md.
Summary
Closes the SSRF hostname-bypass in the cluster
base_urlvalidator (the most critical finding from the codebase security review, #504). The prior guard only inspected literal IPs, so in the hardened posture (RELYLOOP_ALLOW_PRIVATE_CLUSTERS=False) any DNS hostname —metadata.google.internal, an internal name, or anything resolving to a private/loopback/link-local IP — sailed through unchecked and got probed.backend/app/domain/cluster/url_policy.py):is_blocked_address(private/loopback/link-local/reserved/multicast/unspecified + CGNAT100.64.0.0/10, IPv4-mapped-IPv6 unwrap) +is_metadata_hostname+ denylist. No I/O.backend/app/services/cluster_url_policy.py):assert_base_url_allowed— flag-gated; metadata short-circuit → literal-IP classify → elsegetaddrinfo+ classify every resolved address.gaierrorfalls through (an unresolvable host is not an SSRF target). Runs before any adapter build / probe inregister_cluster+test_cluster_connection.400 CLUSTER_URL_BLOCKED(non-retryable) onPOST /clusters+POST /clusters/test-connection.base_urlvalidators into one shared structural helper (scheme + host only); the SSRF policy moved to the service layer (DNS is I/O — must not block the sync Pydantic validator, per Gemini's earlier note on the idea).RELYLOOP_ALLOW_PRIVATE_CLUSTERSdefaultsTrue, so the policy is a strict no-op for local Docker hostnames — ships dormant, activates only in the hardened posture.Phase 2 (connect-time IP pinning for DNS rebinding) deferred →
phase2_idea.md.Test coverage
test_cluster_url_policy.py(domain — 28: every blocked range incl. IPv4-mapped, metadata denylist) + (services — 10: flag no-op, metadata, literal, resolved-private/public, mixed-resolution fail-safe, gaierror fall-through, enforcement-fires-before-DB-and-probe on both entry points).test_error_codes.py—CLUSTER_URL_BLOCKEDenvelope on both endpoints;test_clusters_api_contract.pyscheme test unchanged (still green).test_cluster_url_ssrf.py— real-DNS proof (flag-False +http://elasticsearch:9200resolves to a private Docker IP → blocked) + flag-True no-op end-to-end.Test plan
uv run pytest backend/tests/unit/— 2541 passed, 1 skippedruff check+ruff format --check backend/— clean (609 files)mypy backend/app— clean (177 files)No migration, no UI. Cross-model review: Opus self-review (GPT-5.5 unreachable in the remote sandbox); Gemini Code Assist is the live cross-family gate on this PR.
Closes #504
https://claude.ai/code/session_01WmcRoU8EEzy7dZKB1Jkq34
Generated by Claude Code