Skip to content

fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat#234

Open
Jonathan Jamroga (jjamroga) wants to merge 1 commit into
agent-substrate:mainfrom
jjamroga:jjamroga/decouple-workflow-deadline
Open

fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat#234
Jonathan Jamroga (jjamroga) wants to merge 1 commit into
agent-substrate:mainfrom
jjamroga:jjamroga/decouple-workflow-deadline

Conversation

@jjamroga

@jjamroga Jonathan Jamroga (jjamroga) commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Fixes #233.

Customer Impact: Prior to #230, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in atelet i.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Orphaned image pulls resulted in atelet consuming almost 10 GiB of memory before being garbage collected.

Now, long running image pulls have a process of reacquiring a lock, preventing it from expiring before the image pull has completed. Introduced a high level, user configurable timeout ensuring that there is a window in which the lock can be lost and be lost in the event of image pull hanging (instead of a similar bug where a process is left hanging despite the caller having abandoned it.)

Summary

Decouples the per-actor workflow deadline from the Redis lock TTL via a heartbeat, so large image pulls can finish without raising the lock TTL (which also bounds peer failover after a crashed ateapi).

  • Lock TTL stays at 30s (internal constant). Bounds how long a peer waits to retry an actor after this process crashes.
  • Workflow deadline is now --actor-workflow-deadline (default 5m), independent of the lock TTL. Bounds a single Resume/Suspend.
  • Heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow. On RefreshLock=false or any Redis error, cancels the workflow ctx with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved.

Before this PR, acquireActorLock(ctx, id, 30*time.Second, 2*time.Second) made the workflow ctx expire at lockTTL − padding = 28s, conflating "how long can a workflow run" with "how long until a peer can retry a crashed actor" into a single knob. Raising it to fix large image pulls made peer failover unacceptably slow; lowering it to keep failover fast made image pulls death-loop forever. See the linked issue for the user-facing impact (STATUS_RESUMING forever, PhaseResumeGoldenActor stuck, DeadlineExceeded every ~30s).

Why a server-side deadline at all (not just heartbeat)

Considered keeping only the heartbeat and letting the workflow ctx inherit from the caller. Rejected on defense-in-depth grounds:

  • Substrate does not configure gRPC keepalive, so a network partition or a hung-but-not-killed caller would leave the workflow running indefinitely on ateapi.
  • A workflow step that forgets to propagate ctx would have no upstream timeout to fall back on.

The 5-minute server-side cap covers both cases at the cost of one config knob. Operators can raise it for slow registries; the lock TTL is independent so peer failover stays bounded at ≤30s regardless.

Test plan

Automated:

  • go test ./cmd/ateapi/internal/store/ateredis/... -run 'TestRefreshLock|TestAcquireLock' — 7 lock tests pass (3 new + 4 existing).
  • go test ./cmd/ateapi/internal/controlapi/ -run TestAcquireActorLock — 4 new heartbeat tests pass:
    • Heartbeat keeps lock alive past TTL.
    • Lost lock cancels workflow ctx with errLostActorLock cause.
    • Release callback removes the lock.
    • Second concurrent acquire returns codes.Aborted.
  • go test ./... -count=1 — full repo suite green (419 tests / 65 packages).
  • go vet ./... — clean.

Manual:

# Setup cluster
./hack/create-kind-cluster.sh && ./hack/install-ate-kind.sh --deploy-ate-system
export KO_DOCKER_REPO=localhost:5001
export KO_DEFAULTPLATFORMS=linux/$(go env GOARCH)
./hack/run-tool.sh ko build -B ./cmd/ateom-gvisor

# Install metrics server to keep tabs on memory
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
# Tweak values for cluster
kubectl patch -n kube-system deploy metrics-server --type=json -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
kubectl scale -n kube-system deploy/metrics-server --replicas=1
kubectl rollout status deploy/metrics-server -n kube-system

# Watch memory consumption
while true; do date; kubectl top pod -n ate-system -l app=atelet --containers >> bin/output-fix.log; sleep 5; done

# Check atelet, image pull starts and eventually completes in a reasonable amount of time
kubectl logs daemonset/atelet -n ate-system --follow
{"time":"2026-06-12T17:33:16.609666504Z","level":"INFO","msg":"Using S3 storage backend"}
{"time":"2026-06-12T17:33:16.61079017Z","level":"INFO","msg":"Starting Prometheus metrics server on :9090"}
{"time":"2026-06-12T17:33:16.626096171Z","level":"INFO","msg":"WorkersManagerService listening","address":{"IP":"::","Port":8085,"Zone":""}}
{"time":"2026-06-12T17:36:40.508440418Z","level":"INFO","msg":"Ref includes digest, checking for cache hit","ref":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","digest":"sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:36:40.508567918Z","level":"INFO","msg":"Ref includes digest, checking for cache hit","ref":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4","digest":"sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:36:40.508644543Z","level":"INFO","msg":"Cache miss","ref":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:36:40.508595126Z","level":"INFO","msg":"Cache miss","ref":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:38:35.082124179Z","level":"INFO","msg":"Handle RPC","method":"/atelet.AteomHerder/Run","req":{"target_ateom_uid":"739d5e60-905f-4681-bac4-171190ab7cba","actor_template_namespace":"ate-repro","actor_template_name":"repro-fat-image","actor_id":"69bcf578-73af-48da-8cb8-c73f4b09622b","runsc":{"amd64":{"sha256_hash":"a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","url":"gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc"},"arm64":{"sha256_hash":"1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9","url":"gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc"}},"spec":{"containers":[{"name":"fat","image":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","command":["/bin/sh","-c","sleep 3600"]}],"pause_image":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4"}},"resp":{},"err":null,"elapsed-time":"2m0.322655054s","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:38:55.689723883Z","level":"INFO","msg":"Handle RPC","method":"/atelet.AteomHerder/Checkpoint","req":{"target_ateom_uid":"739d5e60-905f-4681-bac4-171190ab7cba","actor_template_namespace":"ate-repro","actor_template_name":"repro-fat-image","actor_id":"69bcf578-73af-48da-8cb8-c73f4b09622b","runsc":{"amd64":{"sha256_hash":"a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","url":"gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc"},"arm64":{"sha256_hash":"1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9","url":"gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc"}},"spec":{"containers":[{"name":"fat","image":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","command":["/bin/sh","-c","sleep 3600"]}],"pause_image":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4"},"snapshot_uri_prefix":"gs://ate-snapshots/repro/69bcf578-73af-48da-8cb8-c73f4b09622b/2026-06-12T17:38:55Z-XHXXXIU65H5YZCKRHLGC3XH6F4"},"resp":{},"err":null,"elapsed-time":"685.691709ms","ate.dev/trace-id":"41774e0b1b60502ad62d6351bafd4a1b"}

Out of scope / follow-ups

  • Moving image pulls off the workflow path entirely (background pull worker + status-polling RPC) is the long-term fix per internal/memorypullcache/memorypullcache.go:47-58. Should follow the on-disk / shared layer cache work via ategcs.ObjectStorage. The heartbeat shape introduced here does not preclude that redesign.
  • gRPC keepalive on ateapi server + client sides would tighten the partition-detection story. Left as a separate change.

ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow
ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the
workflow deadline and the lock TTL were a single 28s knob. That meant
image pulls / restores that legitimately need more than 28s death-looped
forever, while raising the knob also raised how long peers wait to retry
an actor after a crashed ateapi replica.

Split the two concerns:

- Lock TTL stays short (30s constant, internal). Bounds peer failover.
- Workflow deadline is a separate operator-configurable knob via the
  new --actor-workflow-deadline pflag (default 5m). Bounds a single
  Resume/Suspend.
- A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for
  the full workflow duration. On RefreshLock=false or any Redis error
  (peer stole the lock, Redis blip), the workflow ctx is cancelled with
  errLostActorLock as the cause so in-flight steps unwind cleanly and
  the mutual-exclusion invariant is preserved.
- The release function stops the heartbeat (waits for goroutine exit)
  before best-effort ReleaseLock.

Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring
the existing ReleaseLock script.
@jjamroga Jonathan Jamroga (jjamroga) marked this pull request as ready for review June 12, 2026 18:14
@jjamroga Jonathan Jamroga (jjamroga) changed the title Decouple actor lock TTL from workflow deadline via heartbeat fix: Pull large images by decoupling actor lock TTL from workflow deadline via heartbeat Jun 12, 2026
@jjamroga Jonathan Jamroga (jjamroga) changed the title fix: Pull large images by decoupling actor lock TTL from workflow deadline via heartbeat fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow large image pulls to succeed by decoupling workflow deadline from lock TTL

1 participant