Skip to content

[Bug] v0.3.10 regression: Sandbox stuck when pod is deleted externally (annotation-based tracking breaks recovery) #611

@rafikt1992

Description

@rafikt1992

What happened

In v0.3.10, if the pod managed by a Sandbox is deleted for any reason (cluster hibernation/wake,
node failure, OOM eviction, manual deletion), the Sandbox controller enters a permanent error loop
and never recreates the pod:

{"level":"error","msg":"Pod not found","error":"Pod \"my-agent\" not found",
 "stacktrace":"...SandboxReconciler.reconcilePod\n\t.../sandbox_controller.go:480"}

{"level":"error","msg":"Reconciler error",
 "error":"pod in annotation get failed: Pod \"my-agent\" not found"}

The Sandbox stays in this broken state indefinitely. The only recovery is to delete the Sandbox
object so the controller recreates it without the stale annotation.


Root cause

v0.3.10 introduced annotation-based pod tracking: after creating a pod, the controller writes
agents.x-k8s.io/pod-name: <name> onto the Sandbox. On every reconcile loop, reconcilePod
(line 480) reads this annotation and does a hard GET on the pod by name. If the pod returns 404,
it returns an error immediately instead of falling through to pod creation:

if podNameAnnotationExists {
    log.Error(err, "Pod not found")
    return nil, fmt.Errorf("pod in annotation get failed: %w", err)
}
// Only reaches pod creation if annotation is absent

What worked in v0.2.1

v0.2.1 used label-selector based pod discovery (the selector field was exposed in Sandbox
status). If the pod was missing, the reconciler would simply create a new one. Pod deletion was
a recoverable condition.


Steps to reproduce

  1. Create a Sandbox using v0.3.10
  2. Wait for the pod to reach Running
  3. kubectl delete pod <sandbox-name> -n <namespace>
  4. Observe: controller logs pod in annotation get failed in a loop, no new pod is created

Confirmed working correctly with v0.2.1.


Expected behavior

When the pod referenced by agents.x-k8s.io/pod-name is not found, the controller should clear
the stale annotation and create a new pod — treating a missing pod as a recoverable condition,
not a terminal error.


Suggested fix

In reconcilePod, change the 404 branch to clear the annotation and fall through to pod creation:

if podNameAnnotationExists && apierrors.IsNotFound(err) {
    // Pod was deleted externally — clear stale annotation and recreate
    clearPodAnnotation(sandbox)
    // fall through to creation
} else if err != nil {
    return nil, fmt.Errorf("pod in annotation get failed: %w", err)
}

Environment

Controller version v0.3.10 (regression vs v0.2.1)
Kubernetes v1.32 (Gardener shoot cluster)
Trigger scenarios cluster hibernation/wake, manual kubectl delete pod, node failure

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions