What happened
In v0.3.10, if the pod managed by a Sandbox is deleted for any reason (cluster hibernation/wake,
node failure, OOM eviction, manual deletion), the Sandbox controller enters a permanent error loop
and never recreates the pod:
{"level":"error","msg":"Pod not found","error":"Pod \"my-agent\" not found",
"stacktrace":"...SandboxReconciler.reconcilePod\n\t.../sandbox_controller.go:480"}
{"level":"error","msg":"Reconciler error",
"error":"pod in annotation get failed: Pod \"my-agent\" not found"}
The Sandbox stays in this broken state indefinitely. The only recovery is to delete the Sandbox
object so the controller recreates it without the stale annotation.
Root cause
v0.3.10 introduced annotation-based pod tracking: after creating a pod, the controller writes
agents.x-k8s.io/pod-name: <name> onto the Sandbox. On every reconcile loop, reconcilePod
(line 480) reads this annotation and does a hard GET on the pod by name. If the pod returns 404,
it returns an error immediately instead of falling through to pod creation:
if podNameAnnotationExists {
log.Error(err, "Pod not found")
return nil, fmt.Errorf("pod in annotation get failed: %w", err)
}
// Only reaches pod creation if annotation is absent
What worked in v0.2.1
v0.2.1 used label-selector based pod discovery (the selector field was exposed in Sandbox
status). If the pod was missing, the reconciler would simply create a new one. Pod deletion was
a recoverable condition.
Steps to reproduce
- Create a Sandbox using
v0.3.10
- Wait for the pod to reach
Running
kubectl delete pod <sandbox-name> -n <namespace>
- Observe: controller logs
pod in annotation get failed in a loop, no new pod is created
Confirmed working correctly with v0.2.1.
Expected behavior
When the pod referenced by agents.x-k8s.io/pod-name is not found, the controller should clear
the stale annotation and create a new pod — treating a missing pod as a recoverable condition,
not a terminal error.
Suggested fix
In reconcilePod, change the 404 branch to clear the annotation and fall through to pod creation:
if podNameAnnotationExists && apierrors.IsNotFound(err) {
// Pod was deleted externally — clear stale annotation and recreate
clearPodAnnotation(sandbox)
// fall through to creation
} else if err != nil {
return nil, fmt.Errorf("pod in annotation get failed: %w", err)
}
Environment
|
|
| Controller version |
v0.3.10 (regression vs v0.2.1) |
| Kubernetes |
v1.32 (Gardener shoot cluster) |
| Trigger scenarios |
cluster hibernation/wake, manual kubectl delete pod, node failure |
What happened
In
v0.3.10, if the pod managed by a Sandbox is deleted for any reason (cluster hibernation/wake,node failure, OOM eviction, manual deletion), the Sandbox controller enters a permanent error loop
and never recreates the pod:
The Sandbox stays in this broken state indefinitely. The only recovery is to delete the Sandbox
object so the controller recreates it without the stale annotation.
Root cause
v0.3.10introduced annotation-based pod tracking: after creating a pod, the controller writesagents.x-k8s.io/pod-name: <name>onto the Sandbox. On every reconcile loop,reconcilePod(line 480) reads this annotation and does a hard GET on the pod by name. If the pod returns 404,
it returns an error immediately instead of falling through to pod creation:
What worked in v0.2.1
v0.2.1used label-selector based pod discovery (theselectorfield was exposed in Sandboxstatus). If the pod was missing, the reconciler would simply create a new one. Pod deletion was
a recoverable condition.
Steps to reproduce
v0.3.10Runningkubectl delete pod <sandbox-name> -n <namespace>pod in annotation get failedin a loop, no new pod is createdConfirmed working correctly with
v0.2.1.Expected behavior
When the pod referenced by
agents.x-k8s.io/pod-nameis not found, the controller should clearthe stale annotation and create a new pod — treating a missing pod as a recoverable condition,
not a terminal error.
Suggested fix
In
reconcilePod, change the 404 branch to clear the annotation and fall through to pod creation:Environment
v0.3.10(regression vsv0.2.1)v1.32(Gardener shoot cluster)kubectl delete pod, node failure