Skip to content

fix(kyverno): pin admission-controller memory to stop OOM-induced pod-creation outage#2181

Draft
devantler wants to merge 1 commit into
mainfrom
claude/fix-kyverno-admission-oom
Draft

fix(kyverno): pin admission-controller memory to stop OOM-induced pod-creation outage#2181
devantler wants to merge 1 commit into
mainfrom
claude/fix-kyverno-admission-oom

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by Claude Code (live investigation of the prod platform)

Problem

The ksail-operator HelmRelease was stuck mid-upgrade, with its ReplicaSet emitting:

FailedCreate: Internal error occurred: failed calling webhook
"ivpol.mutate.kyverno.svc-fail": ... context deadline exceeded

This is a cluster-wide pod-creation outage symptom, not a ksail-operator bug.

Root cause

The Kyverno admission controller was running on the chart-default limits.memory: 384Mi and OOMKilled (exit 137) under image-verification (cosign) + policy-evaluation load. While the admission pods were dead/restarting, the verify-image-signatures webhooks — registered failurePolicy: Fail — could not be served, so the API server rejected every Pod/Deployment/Job CREATE in non-excluded namespaces. That is exactly the FailedCreate seen on ksail-operator (and why its upgrade wedged).

The existing comment claimed auto-vpa.yaml would right-size this Deployment, so "a hard limit here would fight VPA." That assumption is false: Kyverno deliberately excludes its own namespace from its admission/generate policies (to avoid an eviction deadlock), so auto-vpa never generates a VPA for Kyverno controllers — verified, zero VPAs in ns/kyverno. The admission controller was therefore stranded permanently on the too-low chart default with nothing to raise it.

Fix

Set an explicit admissionController.resources (requests 100m/256Mi, limits memory: 1Gi) — generous headroom (live idle is ~133–149Mi; OOM happened at 384Mi), still far under auto-vpa's 6Gi maxAllowed ceiling — and correct the misleading comment.

Security is unchanged: the webhooks stay failurePolicy: Fail. The reliability problem was the OOM that made fail-closed dangerous; removing the OOM is the right lever. Node-layer enforcement (Talos ImageVerificationConfig) remains in place regardless.

Notes / scope

  • The reports/background controllers also run unmanaged by VPA, but their memory growth is already mitigated by skipResourceFilters + the report-kind filters in this file; only the admission controller caused the outage, so this PR is scoped to it.
  • Self-recovered at investigation time (2/2 admission replicas Ready), but the cap was unchanged — this prevents recurrence.

Validation

kubectl kustomize k8s/bases/infrastructure/controllers/kyverno/ builds; resources renders correctly nested under admissionController.

🤖 Generated with Claude Code

…-creation outage

The Kyverno admission controller ran on the chart-default limits.memory:
384Mi and OOMKilled (exit 137) under image-verification + policy
evaluation load. Because the verify-image-signatures webhooks are
failurePolicy: Fail, the dead admission controller rejected every
Pod/Deployment/Job CREATE cluster-wide (e.g. ksail-operator FailedCreate
and a wedged HelmRelease upgrade) until it recovered.

The previous comment assumed auto-vpa.yaml would size this Deployment,
but Kyverno excludes its own namespace from admission/generate policies
to avoid an eviction deadlock, so no VPA is ever created for kyverno
controllers (verified: zero VPAs in ns/kyverno). The controller was thus
left permanently on the too-low chart default.

Set an explicit admissionController.resources (requests 100m/256Mi,
limits memory 1Gi) -- generous headroom, still far under auto-vpa's 6Gi
maxAllowed ceiling -- and correct the now-inaccurate comment. Keeps the
webhooks failurePolicy: Fail (no weakening of supply-chain enforcement);
the fix removes the OOM that made fail-closed dangerous.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant