Skip to content

Give the prod k8s API a stable endpoint (VIP/LB) so CP-node recreates don't break kubeconfigΒ #2120

@devantler

Description

@devantler

πŸ€– Generated by the Daily AI Assistant

Problem

The prod k8s API has no stable endpoint β€” the Talos cluster.controlPlane.endpoint is pinned to a single control-plane node's public IP (was https://178.105.204.76:6443, the original CP-1). When that node is recreated or deleted (routine, per the cattle-CP model in the DR runbook β€” see [[prod-cd-frozen-by-down-cp-node]] incidents), the endpoint points at a dead host.

Impact β€” recurs on every CP-node recreate:

  • The prod environment KUBE_CONFIG secret (and local ~/.kube/config) point at a dead IP β†’ the deploy pipeline's 🩺 Verify prod cluster is reachable preflight fails with i/o timeout, evicting PRs from the merge queue and freezing all GitOps delivery.
  • talosctl kubeconfig regenerates the same dead endpoint (it copies controlPlane.endpoint verbatim), so a plain regen doesn't fix it β€” every recovery requires a manual repoint to a live CP node IP (DR runbook Scenario 9), which is pure toil and a sharp edge during an incident.

Latest occurrence: 2026-06-17 (CP recreates after the nftables ENOBUFS boot-hang); secrets manually refreshed, but the root fragility remains.

Proposed direction β€” give the API server a stable, failover-aware endpoint

Two viable options on Hetzner Cloud; both make the endpoint independent of any individual CP node's lifecycle. Recommend Option A (native, no recurring cost, fits the cattle-CP philosophy):

Option A β€” Talos-managed Hetzner floating-IP VIP (recommended). Talos natively supports a shared control-plane virtual IP backed by a Hetzner Cloud floating IP (machine.network.interfaces[].vip.hcloud.*); Talos elects a leader among the CP nodes via etcd and reassigns the floating IP to it on failover. Point controlPlane.endpoint at the floating IP / its DNS name. No load balancer, no monthly LB cost. (Confirm the exact vip.hcloud schema against current Talos + ksail docs before implementing.)

Option B β€” Hetzner Cloud Load Balancer for :6443. Provision an LB targeting the 3 CP nodes with a TCP health check on 6443; point controlPlane.endpoint at the LB. Health-aware, dead nodes drained automatically. Downside: ~€5/mo recurring + one more managed resource. (Note: the existing cilium-gateway-platform LB is for ingress, not the API.)

Implementation sketch (whichever option)

  1. Reserve the stable address (Hetzner floating IP for A, or LB IP for B); ideally give it a DNS name so the endpoint never changes again.
  2. Express it through ksail (ksail.prod.yaml) and/or a Talos machine-config patch under platform/talos/ (where install-image.yaml/apparmor.yaml already live) so ksail cluster create/update renders cluster.controlPlane.endpoint = the stable address.
  3. Add the stable host to the apiserver cert SANs (cluster.apiServer.certSANs, and machine certSANs) so TLS validates against the new endpoint. (Today these only contain 127.0.0.1; Talos auto-adds node IPs, which is why repointing at a live node IP currently works.)
  4. Re-run ksail cluster update to roll the config; regenerate KUBE_CONFIG/TALOS_CONFIG once against the stable endpoint and push to the prod env secrets.
  5. Simplify DR runbook Scenario 9 β€” once the endpoint is stable, kubeconfig no longer goes stale on CP recreate; the secret-refresh toil largely disappears.

Acceptance criteria

  • cluster.controlPlane.endpoint resolves to a stable address that survives any single CP node being deleted/recreated.
  • Deleting and recreating any one CP node leaves the deploy preflight and kubectl --context admin@prod working without a manual kubeconfig repoint or secret refresh.
  • The stable endpoint host is in the apiserver cert SANs (TLS validates).
  • DR runbook Scenario 9 updated to reflect the new (much smaller) refresh story.

Size: M (one design decision A-vs-B + a Talos config/patch + cert SAN change + a controlled cluster update roll on prod). Candidate child of the platform roadmap (#2043). Background + manual refresh recipe: platform/docs/dr/runbook.md Scenario 9.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions