Skip to content

[Bug] k8s_helper.py ignores failure conditions and hangs for a full 180 second timeout #574

@tomergee

Description

@tomergee

client/k8s_helper.py watches the SandboxClaim resource until it resolved the underlying sandbox name, but it ignores any failure conditions reported by the controller, causing it to hang for the full 180-second timeout when the claim had already failed (e.g., with a ReconcilerError due to missing warm pods when defaulting to cold pods).

When running a large load test, sandboxclaim controller raised ReconcilerError on several claims, which were not correctly captured by the client, causing multiple 180 second hangs.

Client should fail fast and retry:
resolve_sandbox_name should actively check the claim's Ready condition; if it is False (ReconcilerError or Failed) the method should raise an exception immediately, allowing the application to fail fast and trigger the new retry loop instead of staying stuck in "Created" status.

/bug

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions