Skip to content

Harden sandbox reads during resume#172

Open
RitwijParmar wants to merge 1 commit into
blaxel-ai:mainfrom
RitwijParmar:codex/resume-safe-sandbox-reads
Open

Harden sandbox reads during resume#172
RitwijParmar wants to merge 1 commit into
blaxel-ai:mainfrom
RitwijParmar:codex/resume-safe-sandbox-reads

Conversation

@RitwijParmar

@RitwijParmar RitwijParmar commented Jun 17, 2026

Copy link
Copy Markdown

Fixes ENG-2972

Summary

Fixes #142.

This makes idempotent sandbox reads survive the short gateway window that can happen while a standby sandbox resumes. The existing retry helper already covered transport resets. This extends the same bounded policy to transient gateway responses.

Covered paths:

  • sandbox.fetch() for GET, HEAD, and OPTIONS
  • sync sandbox.fetch() for the same read methods
  • async and sync drive list()
  • async and sync system health()

Mutating calls such as POST fetches, drive mount, drive unmount, and system upgrade are left unchanged so the SDK does not replay side effects.

Notes

The retry classifier is intentionally narrow. It only treats 425, 429, 502, 503, and 504 as retryable sandbox read responses. Other HTTP responses still return or fail as before.

Tests

  • uv run --group test pytest tests/core/test_sandbox_transient_retry.py -q
  • uv run --group test pytest tests/core/test_sandbox_transient_retry.py tests/core/test_sandbox.py tests/core/test_sandbox_network.py -q
  • uv run --group dev ruff check src/blaxel/core/sandbox/transient_retry.py src/blaxel/core/sandbox/default/network.py src/blaxel/core/sandbox/sync/network.py src/blaxel/core/sandbox/default/drive.py src/blaxel/core/sandbox/sync/drive.py src/blaxel/core/sandbox/default/system.py src/blaxel/core/sandbox/sync/system.py tests/core/test_sandbox_transient_retry.py
  • uv run python -m compileall -q src/blaxel/core/sandbox tests/core/test_sandbox_transient_retry.py
  • git diff --check

Note

Extends the existing bounded retry policy to cover transient gateway HTTP responses (425, 429, 502, 503, 504) during sandbox resume. Adds retry_on_transient_sandbox_read (sync and async) that retries on both transport-level errors and response-level gateway statuses. Applied to idempotent reads: network fetch (GET/HEAD/OPTIONS), drive list, and system health. Mutating operations are intentionally excluded.

Written by Mendral for commit 814468b.

@mendral-app mendral-app Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Clean, well-scoped change. The retry logic correctly separates transport errors from response-level retries, the idempotent-only guard on network fetch is sound, and the test coverage is thorough. No correctness or security issues found.

Tag @mendral-app with feedback or questions. View session

@mendral-app

mendral-app Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

🧪 Testing Guide

What this PR addresses

When a standby sandbox resumes, there's a brief window where the gateway returns transient HTTP errors (425, 429, 502, 503, 504). Previously, the SDK's retry logic only handled transport-level resets (connection drops). This PR extends the retry mechanism to also handle these transient gateway HTTP status codes for idempotent read operations (GET, HEAD, OPTIONS), while intentionally leaving mutating calls (POST, etc.) untouched to avoid replaying side effects.

Affected paths: sandbox.fetch() (async + sync), drive list() (async + sync), and system health() (async + sync).

Steps to reproduce the original issue

  1. Have a sandbox in standby/paused state.
  2. Trigger a resume (e.g., by calling sandbox.fetch(port, "/health") or sandbox.system.health()).
  3. During the brief resume window, the gateway responds with HTTP 502/503/504.
  4. Before this PR: The SDK surfaces the error immediately without retrying, causing the caller to fail.

What to verify (expected behavior)

  1. Unit tests pass: Run pytest tests/core/test_sandbox_transient_retry.py — all new and existing tests should pass. Key new tests:

    • test_read_response_classifier_accepts_resume_gateway_statuses — verifies 502, 503 (including HTTPStatus enum) are classified as transient.
    • test_read_response_classifier_rejects_application_statuses — verifies 500, 404 are NOT retried.
    • test_async_sandbox_read_retry_recovers_from_resume_status — async retry on 503 then succeeds on 200.
    • test_sync_sandbox_read_retry_recovers_from_resume_status — sync retry on 502 then succeeds on 200.
    • test_async_network_fetch_retries_resume_gateway_status — integration-level async fetch retry.
    • test_async_network_fetch_does_not_retry_post_status — POST requests are not retried (503 returned as-is).
    • test_sync_network_fetch_retries_resume_gateway_status — integration-level sync fetch retry.
  2. Mutating calls are not retried: Verify that sandbox.fetch(port, "/path", method="POST") does NOT retry on 502/503 — it should return the error response immediately.

  3. Existing transport-reset retries still work: Existing tests like test_async_filesystem_read_retries_transport_reset and test_sync_filesystem_read_retries_transport_reset should continue to pass.

  4. No regression in drive/system APIs: drive.list() and system.health() should behave identically for successful calls (now using *_detailed API variants internally, but the public return types are unchanged).

  5. BL_SANDBOX_READ_RETRIES env var controls retry budget: Setting this to 0 should disable retries entirely.

Note

Posted by PR Testing Guide · Tag @mendral-app with feedback.

@mendral-app

mendral-app Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

🔄 Interaction Flow: Transient Retry for Sandbox Reads

sequenceDiagram
    participant App as Application Code
    participant Drive as Drive.list() / System.health()
    participant Net as Network.fetch()
    participant Retry as retry_on_transient_sandbox_read()
    participant Client as API Client (detailed)
    participant Check as is_transient_sandbox_read_response()
    participant GW as Sandbox Gateway

    Note over App,GW: Idempotent read path (GET/HEAD/OPTIONS)

    App->>Drive: list() / health()
    Drive->>Retry: wrap inner operation (budget from settings)
    loop Retry loop (exponential backoff 0.2s→2.0s)
        Retry->>Client: call *_detailed() endpoint
        Client->>GW: HTTP request
        alt Sandbox resuming
            GW-->>Client: 502/503/504/425/429
            Client-->>Retry: Response with transient status
            Retry->>Check: is_transient_sandbox_read_response(resp)
            Check-->>Retry: True (transient)
            Note over Retry: sleep(backoff), decrement budget
        else Connection reset during resume
            GW--xClient: ConnectionReset / IncompleteRead
            Client--xRetry: Exception raised
            Retry->>Retry: is_transient_reset_error() → retry
        else Sandbox ready
            GW-->>Client: 200 OK
            Client-->>Retry: Response with success status
            Retry->>Check: is_transient_sandbox_read_response(resp)
            Check-->>Retry: False (non-transient)
        end
    end
    Retry-->>Drive: Final Response
    Drive->>Drive: Extract .parsed, validate
    Drive-->>App: Result

    Note over App,GW: Mutating path (POST/PUT/DELETE) — no retry

    App->>Net: fetch(method=POST, ...)
    Net->>Client: client.request() directly
    Client->>GW: HTTP request
    GW-->>Client: Response (any status)
    Client-->>Net: Response
    Net-->>App: Result (no replay of side effects)
Loading

Summary

This PR adds a bounded retry layer for idempotent sandbox reads to handle the transient gateway window during sandbox resume:

Component Role
transient_retry.py New retry utilities + transient status classification ({425, 429, 502, 503, 504})
Drive / System (async & sync) Wrap read calls with retry; switch to *_detailed API imports to expose status codes
Network.fetch() Gate retry on HTTP method — only GET/HEAD/OPTIONS are retried
Settings sandbox_read_retries budget controls max attempts

Key design choice: Mutating calls (POST, PUT, DELETE) are intentionally excluded from retry to avoid replaying side effects.

Note

Posted by PR Sequence Diagram · Tag @mendral-app with feedback.

@mendral-app

mendral-app Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

✅ Linked to Linear issue ENG-2972 — status set to In Progress

This PR directly addresses the intermittent sandbox/Agent Drive API failures after standby/resume tracked in ENG-2972. The PR description has been updated with Fixes ENG-2972 so the issue will auto-close when this PR merges.

Note

Posted by Linear Issue Enforcer · Tag @mendral-app with feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent sandbox/Agent Drive API failures after standby/resume

1 participant