Harden sandbox reads during resume#172
Conversation
There was a problem hiding this comment.
LGTM
Clean, well-scoped change. The retry logic correctly separates transport errors from response-level retries, the idempotent-only guard on network fetch is sound, and the test coverage is thorough. No correctness or security issues found.
Tag @mendral-app with feedback or questions. View session
🧪 Testing GuideWhat this PR addressesWhen a standby sandbox resumes, there's a brief window where the gateway returns transient HTTP errors (425, 429, 502, 503, 504). Previously, the SDK's retry logic only handled transport-level resets (connection drops). This PR extends the retry mechanism to also handle these transient gateway HTTP status codes for idempotent read operations (GET, HEAD, OPTIONS), while intentionally leaving mutating calls (POST, etc.) untouched to avoid replaying side effects. Affected paths: Steps to reproduce the original issue
What to verify (expected behavior)
Note Posted by PR Testing Guide · Tag @mendral-app with feedback. |
🔄 Interaction Flow: Transient Retry for Sandbox ReadssequenceDiagram
participant App as Application Code
participant Drive as Drive.list() / System.health()
participant Net as Network.fetch()
participant Retry as retry_on_transient_sandbox_read()
participant Client as API Client (detailed)
participant Check as is_transient_sandbox_read_response()
participant GW as Sandbox Gateway
Note over App,GW: Idempotent read path (GET/HEAD/OPTIONS)
App->>Drive: list() / health()
Drive->>Retry: wrap inner operation (budget from settings)
loop Retry loop (exponential backoff 0.2s→2.0s)
Retry->>Client: call *_detailed() endpoint
Client->>GW: HTTP request
alt Sandbox resuming
GW-->>Client: 502/503/504/425/429
Client-->>Retry: Response with transient status
Retry->>Check: is_transient_sandbox_read_response(resp)
Check-->>Retry: True (transient)
Note over Retry: sleep(backoff), decrement budget
else Connection reset during resume
GW--xClient: ConnectionReset / IncompleteRead
Client--xRetry: Exception raised
Retry->>Retry: is_transient_reset_error() → retry
else Sandbox ready
GW-->>Client: 200 OK
Client-->>Retry: Response with success status
Retry->>Check: is_transient_sandbox_read_response(resp)
Check-->>Retry: False (non-transient)
end
end
Retry-->>Drive: Final Response
Drive->>Drive: Extract .parsed, validate
Drive-->>App: Result
Note over App,GW: Mutating path (POST/PUT/DELETE) — no retry
App->>Net: fetch(method=POST, ...)
Net->>Client: client.request() directly
Client->>GW: HTTP request
GW-->>Client: Response (any status)
Client-->>Net: Response
Net-->>App: Result (no replay of side effects)
SummaryThis PR adds a bounded retry layer for idempotent sandbox reads to handle the transient gateway window during sandbox resume:
Key design choice: Mutating calls (POST, PUT, DELETE) are intentionally excluded from retry to avoid replaying side effects. Note Posted by PR Sequence Diagram · Tag @mendral-app with feedback. |
|
✅ Linked to Linear issue ENG-2972 — status set to In Progress This PR directly addresses the intermittent sandbox/Agent Drive API failures after standby/resume tracked in ENG-2972. The PR description has been updated with Note Posted by Linear Issue Enforcer · Tag @mendral-app with feedback. |
Fixes ENG-2972
Summary
Fixes #142.
This makes idempotent sandbox reads survive the short gateway window that can happen while a standby sandbox resumes. The existing retry helper already covered transport resets. This extends the same bounded policy to transient gateway responses.
Covered paths:
sandbox.fetch()for GET, HEAD, and OPTIONSsandbox.fetch()for the same read methodslist()health()Mutating calls such as POST fetches, drive mount, drive unmount, and system upgrade are left unchanged so the SDK does not replay side effects.
Notes
The retry classifier is intentionally narrow. It only treats 425, 429, 502, 503, and 504 as retryable sandbox read responses. Other HTTP responses still return or fail as before.
Tests
uv run --group test pytest tests/core/test_sandbox_transient_retry.py -quv run --group test pytest tests/core/test_sandbox_transient_retry.py tests/core/test_sandbox.py tests/core/test_sandbox_network.py -quv run --group dev ruff check src/blaxel/core/sandbox/transient_retry.py src/blaxel/core/sandbox/default/network.py src/blaxel/core/sandbox/sync/network.py src/blaxel/core/sandbox/default/drive.py src/blaxel/core/sandbox/sync/drive.py src/blaxel/core/sandbox/default/system.py src/blaxel/core/sandbox/sync/system.py tests/core/test_sandbox_transient_retry.pyuv run python -m compileall -q src/blaxel/core/sandbox tests/core/test_sandbox_transient_retry.pygit diff --checkNote
Extends the existing bounded retry policy to cover transient gateway HTTP responses (425, 429, 502, 503, 504) during sandbox resume. Adds
retry_on_transient_sandbox_read(sync and async) that retries on both transport-level errors and response-level gateway statuses. Applied to idempotent reads: network fetch (GET/HEAD/OPTIONS), drive list, and system health. Mutating operations are intentionally excluded.Written by Mendral for commit 814468b.