Skip to content

fix: tolerate transient gateway 5xx lane-wide; de-mask live CI#134

Merged
martinkersner merged 1 commit into
mainfrom
fix/live-lane-transient-5xx-boundary
Jul 2, 2026
Merged

fix: tolerate transient gateway 5xx lane-wide; de-mask live CI#134
martinkersner merged 1 commit into
mainfrom
fix/live-lane-transient-5xx-boundary

Conversation

@martinkersner

Copy link
Copy Markdown
Member

Summary

The Live keyed tests lane failed again on a transient prod 504 — this run it hit test_cex_candle (ServerError: (504, 'upstream request timeout')). #133 added transient-5xx tolerance but wired it only into TestPremium, so the same prod-wide gateway flakiness resurfaced on the next unwrapped call. Worse, the run reported green: job-level continue-on-error masked the failed job, so it slipped by silently.

This moves tolerance to the HTTP boundary (can't be whack-a-mole'd again) and makes lane failures visible without gating the push.

Changes

  • tests/conftest.py — replace the per-call live_call() helper with an autouse fixture _tolerate_transient_gateway that monkeypatches API.send_request (the single method every endpoint object inherits) to retry transient 502/503/504 with linear backoff, then pytest.skip. Covers test_call.py, all of test_integration.py, and any future live test. Gated on API_KEY → no-op for keyless/mocked lanes.
  • tests/test_integration.py — revert the 19 live_call(lambda: …) premium wraps from test: tolerate transient gateway 504 on live premium lane #133 to direct calls; drop the now-dead import. No double-wrapping.
  • .github/workflows/live-tests.yml — move continue-on-error from the job to the pytest step so the job conclusion stays honest (setup/dependency failures still go red) while the push stays non-blocking; add a step that emits a ::warning annotation + run-summary line when the live suite fails, so a red live lane is visible instead of a green run hiding it.

Tests

  • Keyless offline suite unchanged: 134 passed, 11 skipped (pytest -m "not integration").
  • black --check + flake8 clean on touched files; workflow YAML parses.

Known limitations

Live keyed lane went red again on a transient prod 504 — this time
test_cex_candle (ServerError 504 "upstream request timeout"). #133's
per-endpoint live_call wrapping only covered TestPremium, so the same
infra flakiness resurfaced on the next unwrapped call. The run still
showed green: job-level continue-on-error masked the failed job.

- conftest: replace per-call live_call helper with autouse fixture
  _tolerate_transient_gateway — monkeypatches API.send_request (the one
  method every endpoint inherits) to retry transient 502/503/504 then
  pytest.skip. Covers test_call.py + test_integration.py + future live
  tests. No-op without a key (keyless/mocked lanes untouched).
- test_integration: revert the 19 live_call(lambda: ...) premium wraps
  to direct calls; drop now-dead import.
- live-tests.yml: move continue-on-error from job to pytest step so the
  job conclusion stays honest (setup failures still red) while the push
  stays non-blocking; add a step that annotates + writes a run-summary
  warning when the live suite fails, so failures are visible not silent.

keyless offline suite unchanged: 134 passed, 11 skipped.
@martinkersner martinkersner merged commit cd26b87 into main Jul 2, 2026
5 checks passed
@martinkersner martinkersner deleted the fix/live-lane-transient-5xx-boundary branch July 2, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant