You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the 2026-06-06 Talos 1.12.4 → 1.13.3 node roll, every Longhorn volume went faulted/detached simultaneously (spire, openbao, umami, wedding-app, headlamp, actual-budget, …). This is root cause RC-2 of the SEV-1 reconciliation wedge (see #1818 and upstream cilium/cilium#46392).
It is decoupled from #1818: that PR removes SPIRE from Longhorn so the mTLS deadlock can't recur. RC-2 is the separate, more general problem — all stateful workloads lose their storage at once during a node roll.
Why it happened
default-replica-count = 3 and only 3 static workers (prod-worker-1/2/3) → Longhorn anti-affinity places one replica per worker.
The Talos OS roll cordoned/drained all three workers near-simultaneously (replica failedAt 20:12 / 20:18 / 20:24) → every volume lost all 3 replicas at once → faulted.
The roll did not pace itself to keep at least one healthy replica online per volume (no wait-for-Longhorn-rebuild between nodes).
Volumes auto-recovered this time only because the old workers' disks (and their Longhorn node objects) came back; had those nodes been destroyed, recovery would have required salvage-from-stale-replica or a Velero restore.
Options to evaluate
Roll pacing (operational + possible ksail change). Upgrade/replace one replica-bearing node at a time and gate progression on Longhorn replica health (and PDBs) so at least one healthy replica of every volume stays online throughout. The Talos OS upgrade path here is driven by ksail — a ksail feature to drain one node at a time and wait on Longhorn/PDB health would make this safe by default. → spin off a devantler-tech/ksail issue if we want it enforced rather than documented. (Related: [talos-upgrade-ksail-strips-extensions] / ksail#5077.)
Replica placement (Longhorn HelmRelease config). Let replicas spread beyond the 3 static workers (e.g. onto autoscale nodes) and/or tune replica-soft-anti-affinity / zone awareness so a single pool roll can't take all replicas of a volume. Caveat: Hetzner nodes may share a location, limiting zone anti-affinity.
Backups as the floor. Velero CSI snapshot + data-mover for Longhorn PVCs is already in place (feat(velero): CSI snapshot + data-mover backups for Longhorn PVCs #1790) — confirm it would actually allow a restore if replica data did not survive on the rolled nodes (i.e. validate a restore drill), since that's the real safety net for a worst-case roll.
Acceptance
A documented and (where feasible) enforced procedure such that a Talos OS roll never faults all replicas of a volume simultaneously — stateful workloads stay available (degraded at worst) through a roll.
Context
During the 2026-06-06 Talos
1.12.4 → 1.13.3node roll, every Longhorn volume wentfaulted/detachedsimultaneously (spire, openbao, umami, wedding-app, headlamp, actual-budget, …). This is root cause RC-2 of the SEV-1 reconciliation wedge (see #1818 and upstream cilium/cilium#46392).It is decoupled from #1818: that PR removes SPIRE from Longhorn so the mTLS deadlock can't recur. RC-2 is the separate, more general problem — all stateful workloads lose their storage at once during a node roll.
Why it happened
default-replica-count = 3and only 3 static workers (prod-worker-1/2/3) → Longhorn anti-affinity places one replica per worker.failedAt20:12 / 20:18 / 20:24) → every volume lost all 3 replicas at once →faulted.Options to evaluate
devantler-tech/ksailissue if we want it enforced rather than documented. (Related: [talos-upgrade-ksail-strips-extensions] / ksail#5077.)replica-soft-anti-affinity/ zone awareness so a single pool roll can't take all replicas of a volume. Caveat: Hetzner nodes may share a location, limiting zone anti-affinity.Acceptance
A documented and (where feasible) enforced procedure such that a Talos OS roll never faults all replicas of a volume simultaneously — stateful workloads stay available (degraded at worst) through a roll.
Refs