Skip to content

feat(cluster): hostname-aware SSRF guard for cluster base_url#510

Merged
SoundMindsAI merged 6 commits into
mainfrom
claude/codebase-security-review-6njwio
Jun 9, 2026
Merged

feat(cluster): hostname-aware SSRF guard for cluster base_url#510
SoundMindsAI merged 6 commits into
mainfrom
claude/codebase-security-review-6njwio

Conversation

@SoundMindsAI

Copy link
Copy Markdown
Owner

Summary

Closes the SSRF hostname-bypass in the cluster base_url validator (the most critical finding from the codebase security review, #504). The prior guard only inspected literal IPs, so in the hardened posture (RELYLOOP_ALLOW_PRIVATE_CLUSTERS=False) any DNS hostnamemetadata.google.internal, an internal name, or anything resolving to a private/loopback/link-local IP — sailed through unchecked and got probed.

  • Pure classifier (backend/app/domain/cluster/url_policy.py): is_blocked_address (private/loopback/link-local/reserved/multicast/unspecified + CGNAT 100.64.0.0/10, IPv4-mapped-IPv6 unwrap) + is_metadata_hostname + denylist. No I/O.
  • Async orchestrator (backend/app/services/cluster_url_policy.py): assert_base_url_allowed — flag-gated; metadata short-circuit → literal-IP classify → else getaddrinfo + classify every resolved address. gaierror falls through (an unresolvable host is not an SSRF target). Runs before any adapter build / probe in register_cluster + test_cluster_connection.
  • Error contract: new 400 CLUSTER_URL_BLOCKED (non-retryable) on POST /clusters + POST /clusters/test-connection.
  • Refactor (FR-4): collapsed the two byte-for-byte-duplicated base_url validators into one shared structural helper (scheme + host only); the SSRF policy moved to the service layer (DNS is I/O — must not block the sync Pydantic validator, per Gemini's earlier note on the idea).
  • Default unchanged: RELYLOOP_ALLOW_PRIVATE_CLUSTERS defaults True, so the policy is a strict no-op for local Docker hostnames — ships dormant, activates only in the hardened posture.

Phase 2 (connect-time IP pinning for DNS rebinding) deferred → phase2_idea.md.

Test coverage

  • Unit (38, runnable without a stack): test_cluster_url_policy.py (domain — 28: every blocked range incl. IPv4-mapped, metadata denylist) + (services — 10: flag no-op, metadata, literal, resolved-private/public, mixed-resolution fail-safe, gaierror fall-through, enforcement-fires-before-DB-and-probe on both entry points).
  • Contract (CI-gated): test_error_codes.pyCLUSTER_URL_BLOCKED envelope on both endpoints; test_clusters_api_contract.py scheme test unchanged (still green).
  • Integration (CI-gated): test_cluster_url_ssrf.pyreal-DNS proof (flag-False + http://elasticsearch:9200 resolves to a private Docker IP → blocked) + flag-True no-op end-to-end.

Test plan

  • uv run pytest backend/tests/unit/ — 2541 passed, 1 skipped
  • ruff check + ruff format --check backend/ — clean (609 files)
  • mypy backend/app — clean (177 files)
  • Integration + contract layers (need Postgres + ES — run in CI)

No migration, no UI. Cross-model review: Opus self-review (GPT-5.5 unreachable in the remote sandbox); Gemini Code Assist is the live cross-family gate on this PR.

Closes #504

https://claude.ai/code/session_01WmcRoU8EEzy7dZKB1Jkq34


Generated by Claude Code

claude added 5 commits June 9, 2026 19:16
Phase 1: flag-gated async resolve-and-classify SSRF policy at cluster
registration + test-connection (close the hostname bypass), literal-IP
range-set expansion, metadata-hostname denylist, validator de-dup, new
CLUSTER_URL_BLOCKED 400 code. Phase 2 (connect-time IP pinning for DNS
rebinding) deferred to phase2_idea.md.

cross-model review: Opus self-review (GPT-5.5 unreachable in sandbox).

Signed-off-by: Claude <noreply@anthropic.com>
3-story epic (pure domain classifier / async service orchestrator +
wiring + validator de-dup + router + tests / docs). Backend-only, no
migration, no UI. Plan consistency review green; all file:line claims
verified against the tree.

cross-model review: Opus self-review (GPT-5.5 unreachable in sandbox).

Signed-off-by: Claude <noreply@anthropic.com>
Add backend/app/domain/cluster/url_policy.py: is_blocked_address (private/
loopback/link-local/reserved/multicast/unspecified/CGNAT, IPv4-mapped IPv6
unwrap) + is_metadata_hostname + METADATA_HOSTNAMES denylist. Pure domain
logic, no I/O. 28 unit tests (AC-1, AC-6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Claude <noreply@anthropic.com>
…onnection (Story 1.2)

- services/cluster_url_policy.py: async assert_base_url_allowed — flag-gated
  (RELYLOOP_ALLOW_PRIVATE_CLUSTERS), metadata-host short-circuit, literal-IP
  classify, else getaddrinfo + classify every resolved addr; gaierror falls
  through. Defines ClusterUrlBlocked (re-exported from cluster.py to avoid a cycle).
- cluster.py: call the guard before adapter build in register_cluster +
  test_cluster_connection.
- clusters.py: map ClusterUrlBlocked -> 400 CLUSTER_URL_BLOCKED on both endpoints.
- schemas.py: collapse the two duplicated base_url validators into one shared
  structural helper (scheme + host only); SSRF policy moved to the service.
- Tests: 10 service unit (flag no-op, metadata, literal, resolved-private/public,
  mixed, gaierror, enforcement-before-probe), contract envelope + integration
  real-DNS (CI-gated), CLUSTER_URL_BLOCKED registered in test_error_codes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Claude <noreply@anthropic.com>
New docs/04_security/cluster-url-ssrf.md (control, flag gate, blocked
ranges, metadata denylist, DNS-rebinding residual → Phase 2) + README
contents row + cluster-lifecycle.md pre-probe gate note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Claude <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a hostname-aware SSRF guard for cluster base_urls, active when private clusters are disallowed. It introduces a pure domain classifier for blocked IP ranges and metadata hostnames, and an async service orchestrator that performs non-blocking DNS resolution. It also consolidates duplicate Pydantic validators into a single structural check. Feedback suggests adding a timeout to the DNS resolution to prevent API hangs and validating the port structure early in the schema validation to catch malformed ports.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +89 to +92
try:
infos = await loop.getaddrinfo(host, None, type=socket.SOCK_STREAM)
except socket.gaierror:
return # unresolvable -> not an SSRF target; probe will report unreachable

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential denial of service or API hangs when resolving slow or unresponsive DNS hostnames, it is highly recommended to enforce a timeout on the DNS resolution. Wrapping loop.getaddrinfo in asyncio.wait_for with a reasonable timeout (e.g., 5.0 seconds) ensures the API remains responsive.

Suggested change
try:
infos = await loop.getaddrinfo(host, None, type=socket.SOCK_STREAM)
except socket.gaierror:
return # unresolvable -> not an SSRF target; probe will report unreachable
try:
infos = await asyncio.wait_for(
loop.getaddrinfo(host, None, type=socket.SOCK_STREAM),
timeout=5.0,
)
except (socket.gaierror, asyncio.TimeoutError):
return # unresolvable or timeout -> not an SSRF target; probe will report unreachable

Comment on lines +74 to +79
parsed = urlparse(v)
if parsed.scheme not in ("http", "https"):
raise ValueError("base_url must use http or https scheme")
if not parsed.hostname:
raise ValueError("base_url must include a host")
return v

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing parsed.port during structural validation ensures that any malformed port (e.g., http://elasticsearch:9200abc) is caught early and rejected with a 422 VALIDATION_ERROR rather than propagating to the service layer and causing unexpected failures later.

Suggested change
parsed = urlparse(v)
if parsed.scheme not in ("http", "https"):
raise ValueError("base_url must use http or https scheme")
if not parsed.hostname:
raise ValueError("base_url must include a host")
return v
parsed = urlparse(v)
if parsed.scheme not in ("http", "https"):
raise ValueError("base_url must use http or https scheme")
if not parsed.hostname:
raise ValueError("base_url must include a host")
try:
_ = parsed.port
except ValueError as exc:
raise ValueError("base_url has an invalid port") from exc
return v

…view)

- cluster_url_policy.py: wrap getaddrinfo in asyncio.wait_for(5s); a slow/
  blackholed resolver now times out and falls through (not an SSRF hit)
  instead of hanging the request (Gemini #1).
- schemas.py: structural validator accesses parsed.port so a malformed port
  (e.g. ':9200abc') surfaces as 422 VALIDATION_ERROR early (Gemini #2).
- Tests: DNS-timeout fall-through (service unit) + malformed-port 422 (contract).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Claude <noreply@anthropic.com>

Copy link
Copy Markdown
Owner Author

Review adjudication (Gemini Code Assist)

Commit landing fixes: 8c2ca37

Gemini Code Assist (2 findings)

# Sev Location Verdict Notes
1 Medium backend/app/services/cluster_url_policy.py:92 Accepted Wrapped getaddrinfo in asyncio.wait_for(timeout=5.0); a slow/blackholed resolver now trips the timeout and falls through (treated as "not an SSRF hit") instead of hanging the request. Directly hardens the "resolver hang" failure mode the spec had flagged as Phase-1-acceptable. New unit test test_dns_timeout_falls_through.
2 Medium backend/app/api/v1/schemas.py:79 Accepted Structural validator now accesses parsed.port, so a malformed port (http://host:9200abc) surfaces as 422 VALIDATION_ERROR early instead of propagating. New contract test test_create_cluster_request_rejects_malformed_port.

Outcomes

  • Applied fixes (2): DNS-resolution timeout; malformed-port 422.
  • Rejected (0):
  • Deferred (0):

Local gate after fixes: 52 targeted tests pass · ruff format --check + mypy backend/app (177 files) clean. CI re-running on 8c2ca37.

Cross-model review: Opus self-review (GPT-5.5 unreachable in the remote sandbox); Gemini is the live cross-family gate.


Generated by Claude Code

@SoundMindsAI SoundMindsAI merged commit 3cb28c7 into main Jun 9, 2026
19 checks passed
@SoundMindsAI SoundMindsAI deleted the claude/codebase-security-review-6njwio branch June 9, 2026 19:50
SoundMindsAI added a commit that referenced this pull request Jun 9, 2026
Finalization bookkeeping for the cluster base_url SSRF guard (Phase 1,
#510): pipeline_status + plan status → Complete, state.md last-5-merges +
context refresh, MVP2 dashboard + public roadmap regen. Folder retained in
planned_features/ — Phase 2 (connect-time IP pinning) still tracked in
phase2_idea.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug_cluster_url_ssrf_hostname_bypass: cluster base_url SSRF guard only inspects literal IPs

2 participants