fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat#234
Open
Jonathan Jamroga (jjamroga) wants to merge 1 commit into
Conversation
ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the workflow deadline and the lock TTL were a single 28s knob. That meant image pulls / restores that legitimately need more than 28s death-looped forever, while raising the knob also raised how long peers wait to retry an actor after a crashed ateapi replica. Split the two concerns: - Lock TTL stays short (30s constant, internal). Bounds peer failover. - Workflow deadline is a separate operator-configurable knob via the new --actor-workflow-deadline pflag (default 5m). Bounds a single Resume/Suspend. - A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), the workflow ctx is cancelled with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved. - The release function stops the heartbeat (waits for goroutine exit) before best-effort ReleaseLock. Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring the existing ReleaseLock script.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #233.
Customer Impact: Prior to #230, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in atelet i.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Orphaned image pulls resulted in atelet consuming almost 10 GiB of memory before being garbage collected.
Now, long running image pulls have a process of reacquiring a lock, preventing it from expiring before the image pull has completed. Introduced a high level, user configurable timeout ensuring that there is a window in which the lock can be lost and be lost in the event of image pull hanging (instead of a similar bug where a process is left hanging despite the caller having abandoned it.)
Summary
Decouples the per-actor workflow deadline from the Redis lock TTL via a heartbeat, so large image pulls can finish without raising the lock TTL (which also bounds peer failover after a crashed
ateapi).--actor-workflow-deadline(default 5m), independent of the lock TTL. Bounds a single Resume/Suspend.lockTTL/3(~10s) for the full workflow. OnRefreshLock=falseor any Redis error, cancels the workflow ctx witherrLostActorLockas the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved.Before this PR,
acquireActorLock(ctx, id, 30*time.Second, 2*time.Second)made the workflow ctx expire atlockTTL − padding = 28s, conflating "how long can a workflow run" with "how long until a peer can retry a crashed actor" into a single knob. Raising it to fix large image pulls made peer failover unacceptably slow; lowering it to keep failover fast made image pulls death-loop forever. See the linked issue for the user-facing impact (STATUS_RESUMINGforever,PhaseResumeGoldenActorstuck,DeadlineExceededevery ~30s).Why a server-side deadline at all (not just heartbeat)
Considered keeping only the heartbeat and letting the workflow ctx inherit from the caller. Rejected on defense-in-depth grounds:
ateapi.The 5-minute server-side cap covers both cases at the cost of one config knob. Operators can raise it for slow registries; the lock TTL is independent so peer failover stays bounded at ≤30s regardless.
Test plan
Automated:
go test ./cmd/ateapi/internal/store/ateredis/... -run 'TestRefreshLock|TestAcquireLock'— 7 lock tests pass (3 new + 4 existing).go test ./cmd/ateapi/internal/controlapi/ -run TestAcquireActorLock— 4 new heartbeat tests pass:errLostActorLockcause.codes.Aborted.go test ./... -count=1— full repo suite green (419 tests / 65 packages).go vet ./...— clean.Manual:
Out of scope / follow-ups
internal/memorypullcache/memorypullcache.go:47-58. Should follow the on-disk / shared layer cache work viaategcs.ObjectStorage. The heartbeat shape introduced here does not preclude that redesign.ateapiserver + client sides would tighten the partition-detection story. Left as a separate change.