fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange by devantler · Pull Request #2180 · devantler-tech/platform

devantler · 2026-06-20T07:25:38Z

🤖 Generated by Claude Code (live investigation of the prod platform)

Problem (active prod incident, 2026-06-20)

All three CNPG Postgres clusters (coroot-db, umami-db, wedding-db) went degraded with replicas stuck in a timeline-divergence / pg_rewind loop, wedging the infrastructure and apps Flux Kustomizations and taking Coroot offline (its metadata DB crashlooped). Root-caused to the Longhorn storage data plane, not Postgres.

Root cause

The add-ns-quota Kyverno policy (patched in this file) generates a LimitRange into every namespace that stamps a default memory limit of 512Mi onto any container that doesn't set its own. longhorn-system was not exempt, so it applied to the Longhorn instance-manager pods.

longhorn-manager creates instance-manager pods without a memory limit by design (Longhorn's documented recommendation). The injected 512Mi cap is too low — idle RSS already sits at ~335–355Mi (~70% of the cap), so a volume rebuild pushes it over and the kubelet OOM-kills it. The chain:

LimitRange 512Mi on instance-manager
  → OOMKill during rebuild (evidence destroyed with the pod)
  → longhorn-manager: "Instance manager pod ... is deleted or not running, recreating"
  → every replica engine on that node faults (DetachedUnexpectedly / Faulted)
  → CNPG primary loses its volume → failover
  → a lagging standby is promoted → timeline fork → old replica stuck needing pg_rewind
  → CNPG cluster never reaches quorum → Flux health checks time out

Evidence: instance-manager pods recreated node-by-node (worker-1/2 ~4h apart, worker-3 minutes before investigation); kubectl top shows IM at ~70% of the 512Mi limit at idle; longhorn-manager logs show the recreate→remount cascade; sync replication is already correctly configured (so the divergence is a symptom of the storage faults, not a Postgres misconfig).

Fix

Exempt longhorn-system from both generated rules:

rule 1 (LimitRange) — removes the OOM-inducing memory limit so instance-managers run unconstrained, per Longhorn guidance.
rule 0 (ResourceQuota) — required in tandem: without the LimitRange-supplied default requests, pods would be rejected by the requests.memory quota. Treated like the existing flux-system exemption.

Longhorn's components are sized by their own VPAs (longhorn-manager, csi-*, longhorn-ui) / the operator, not by the generic tenant quota.

Validation

kubectl kustomize k8s/bases/infrastructure/cluster-policies/ builds; longhorn-system renders in both add-ns-quota exclude rules.
kubectl kustomize k8s/clusters/{prod,local}/ build.

Operational follow-up (not in this PR)

This prevents recurrence. The already-diverged replicas (e.g. coroot-db-3) won't self-heal from the pg_rewind loop — once this merges and Longhorn stabilises, an operator should re-clone them (delete the stuck instance's PVC so CNPG rebuilds from the primary). Kyverno should remove the now-orphaned LimitRange/ResourceQuota in longhorn-system via synchronize: true; delete manually if it doesn't.

🤖 Generated with Claude Code

The add-ns-quota Kyverno policy generates a LimitRange into every namespace that stamps a default memory limit (512Mi) onto any container that omits one. longhorn-manager creates instance-manager pods without a memory limit by design; the injected 512Mi cap OOM-kills them during volume rebuilds (idle RSS already sits at ~70% of the cap). When an instance-manager dies, longhorn-manager deletes+recreates the pod, which faults every replica engine on that node (DetachedUnexpectedly) and cascades into CNPG primary failover and Postgres timeline divergence across coroot-db / umami-db / wedding-db, wedging the infrastructure and apps Flux Kustomizations cluster-wide (observed 2026-06-20). Exempt longhorn-system from both the LimitRange rule (drops the OOM-inducing limit) and the ResourceQuota rule (pods that no longer receive the LimitRange-supplied requests would otherwise be rejected by the requests.memory quota), treating it as platform infra like flux-system. Longhorn's components are sized by their own VPAs/operator. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-project-automation Bot added this to 🌊 Project Board Jun 20, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 20, 2026

devantler had a problem deploying to ci June 20, 2026 07:25 — with GitHub Actions Failure

devantler marked this pull request as ready for review June 20, 2026 09:48

devantler enabled auto-merge June 20, 2026 09:48

devantler had a problem deploying to ci June 20, 2026 19:01 — with GitHub Actions Failure

devantler mentioned this pull request Jun 20, 2026

fix(ci): pin ksail to v7.65.0 in system-test to avoid the 7.66.0 validate race #2184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange#2180

fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange#2180
devantler wants to merge 1 commit into
mainfrom
claude/fix-longhorn-limitrange-oom

devantler commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 20, 2026

Problem (active prod incident, 2026-06-20)

Root cause

Fix

Validation

Operational follow-up (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant