fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange#2180
Open
devantler wants to merge 1 commit into
Open
fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange#2180devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
The add-ns-quota Kyverno policy generates a LimitRange into every namespace that stamps a default memory limit (512Mi) onto any container that omits one. longhorn-manager creates instance-manager pods without a memory limit by design; the injected 512Mi cap OOM-kills them during volume rebuilds (idle RSS already sits at ~70% of the cap). When an instance-manager dies, longhorn-manager deletes+recreates the pod, which faults every replica engine on that node (DetachedUnexpectedly) and cascades into CNPG primary failover and Postgres timeline divergence across coroot-db / umami-db / wedding-db, wedging the infrastructure and apps Flux Kustomizations cluster-wide (observed 2026-06-20). Exempt longhorn-system from both the LimitRange rule (drops the OOM-inducing limit) and the ResourceQuota rule (pods that no longer receive the LimitRange-supplied requests would otherwise be rejected by the requests.memory quota), treating it as platform infra like flux-system. Longhorn's components are sized by their own VPAs/operator. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (active prod incident, 2026-06-20)
All three CNPG Postgres clusters (
coroot-db,umami-db,wedding-db) went degraded with replicas stuck in a timeline-divergence /pg_rewindloop, wedging theinfrastructureandappsFlux Kustomizations and taking Coroot offline (its metadata DB crashlooped). Root-caused to the Longhorn storage data plane, not Postgres.Root cause
The
add-ns-quotaKyverno policy (patched in this file) generates aLimitRangeinto every namespace that stamps adefaultmemory limit of 512Mi onto any container that doesn't set its own.longhorn-systemwas not exempt, so it applied to the Longhorn instance-manager pods.longhorn-manager creates instance-manager pods without a memory limit by design (Longhorn's documented recommendation). The injected 512Mi cap is too low — idle RSS already sits at ~335–355Mi (~70% of the cap), so a volume rebuild pushes it over and the kubelet OOM-kills it. The chain:
Evidence: instance-manager pods recreated node-by-node (worker-1/2 ~4h apart, worker-3 minutes before investigation);
kubectl topshows IM at ~70% of the 512Mi limit at idle; longhorn-manager logs show the recreate→remount cascade; sync replication is already correctly configured (so the divergence is a symptom of the storage faults, not a Postgres misconfig).Fix
Exempt
longhorn-systemfrom both generated rules:requests.memoryquota. Treated like the existingflux-systemexemption.Longhorn's components are sized by their own VPAs (
longhorn-manager,csi-*,longhorn-ui) / the operator, not by the generic tenant quota.Validation
kubectl kustomize k8s/bases/infrastructure/cluster-policies/builds;longhorn-systemrenders in bothadd-ns-quotaexclude rules.kubectl kustomize k8s/clusters/{prod,local}/build.Operational follow-up (not in this PR)
This prevents recurrence. The already-diverged replicas (e.g.
coroot-db-3) won't self-heal from thepg_rewindloop — once this merges and Longhorn stabilises, an operator should re-clone them (delete the stuck instance's PVC so CNPG rebuilds from the primary). Kyverno should remove the now-orphanedLimitRange/ResourceQuotainlonghorn-systemviasynchronize: true; delete manually if it doesn't.🤖 Generated with Claude Code