π€ Generated by the Daily AI Assistant
Problem
The prod k8s API has no stable endpoint β the Talos cluster.controlPlane.endpoint is pinned to a single control-plane node's public IP (was https://178.105.204.76:6443, the original CP-1). When that node is recreated or deleted (routine, per the cattle-CP model in the DR runbook β see [[prod-cd-frozen-by-down-cp-node]] incidents), the endpoint points at a dead host.
Impact β recurs on every CP-node recreate:
- The
prod environment KUBE_CONFIG secret (and local ~/.kube/config) point at a dead IP β the deploy pipeline's π©Ί Verify prod cluster is reachable preflight fails with i/o timeout, evicting PRs from the merge queue and freezing all GitOps delivery.
talosctl kubeconfig regenerates the same dead endpoint (it copies controlPlane.endpoint verbatim), so a plain regen doesn't fix it β every recovery requires a manual repoint to a live CP node IP (DR runbook Scenario 9), which is pure toil and a sharp edge during an incident.
Latest occurrence: 2026-06-17 (CP recreates after the nftables ENOBUFS boot-hang); secrets manually refreshed, but the root fragility remains.
Proposed direction β give the API server a stable, failover-aware endpoint
Two viable options on Hetzner Cloud; both make the endpoint independent of any individual CP node's lifecycle. Recommend Option A (native, no recurring cost, fits the cattle-CP philosophy):
Option A β Talos-managed Hetzner floating-IP VIP (recommended). Talos natively supports a shared control-plane virtual IP backed by a Hetzner Cloud floating IP (machine.network.interfaces[].vip.hcloud.*); Talos elects a leader among the CP nodes via etcd and reassigns the floating IP to it on failover. Point controlPlane.endpoint at the floating IP / its DNS name. No load balancer, no monthly LB cost. (Confirm the exact vip.hcloud schema against current Talos + ksail docs before implementing.)
Option B β Hetzner Cloud Load Balancer for :6443. Provision an LB targeting the 3 CP nodes with a TCP health check on 6443; point controlPlane.endpoint at the LB. Health-aware, dead nodes drained automatically. Downside: ~β¬5/mo recurring + one more managed resource. (Note: the existing cilium-gateway-platform LB is for ingress, not the API.)
Implementation sketch (whichever option)
- Reserve the stable address (Hetzner floating IP for A, or LB IP for B); ideally give it a DNS name so the endpoint never changes again.
- Express it through ksail (
ksail.prod.yaml) and/or a Talos machine-config patch under platform/talos/ (where install-image.yaml/apparmor.yaml already live) so ksail cluster create/update renders cluster.controlPlane.endpoint = the stable address.
- Add the stable host to the apiserver cert SANs (
cluster.apiServer.certSANs, and machine certSANs) so TLS validates against the new endpoint. (Today these only contain 127.0.0.1; Talos auto-adds node IPs, which is why repointing at a live node IP currently works.)
- Re-run
ksail cluster update to roll the config; regenerate KUBE_CONFIG/TALOS_CONFIG once against the stable endpoint and push to the prod env secrets.
- Simplify DR runbook Scenario 9 β once the endpoint is stable, kubeconfig no longer goes stale on CP recreate; the secret-refresh toil largely disappears.
Acceptance criteria
Size: M (one design decision A-vs-B + a Talos config/patch + cert SAN change + a controlled cluster update roll on prod). Candidate child of the platform roadmap (#2043). Background + manual refresh recipe: platform/docs/dr/runbook.md Scenario 9.
Problem
The prod k8s API has no stable endpoint β the Talos
cluster.controlPlane.endpointis pinned to a single control-plane node's public IP (washttps://178.105.204.76:6443, the original CP-1). When that node is recreated or deleted (routine, per the cattle-CP model in the DR runbook β see [[prod-cd-frozen-by-down-cp-node]] incidents), the endpoint points at a dead host.Impact β recurs on every CP-node recreate:
prodenvironmentKUBE_CONFIGsecret (and local~/.kube/config) point at a dead IP β the deploy pipeline'sπ©Ί Verify prod cluster is reachablepreflight fails withi/o timeout, evicting PRs from the merge queue and freezing all GitOps delivery.talosctl kubeconfigregenerates the same dead endpoint (it copiescontrolPlane.endpointverbatim), so a plain regen doesn't fix it β every recovery requires a manual repoint to a live CP node IP (DR runbook Scenario 9), which is pure toil and a sharp edge during an incident.Latest occurrence: 2026-06-17 (CP recreates after the nftables ENOBUFS boot-hang); secrets manually refreshed, but the root fragility remains.
Proposed direction β give the API server a stable, failover-aware endpoint
Two viable options on Hetzner Cloud; both make the endpoint independent of any individual CP node's lifecycle. Recommend Option A (native, no recurring cost, fits the cattle-CP philosophy):
Option A β Talos-managed Hetzner floating-IP VIP (recommended). Talos natively supports a shared control-plane virtual IP backed by a Hetzner Cloud floating IP (
machine.network.interfaces[].vip.hcloud.*); Talos elects a leader among the CP nodes via etcd and reassigns the floating IP to it on failover. PointcontrolPlane.endpointat the floating IP / its DNS name. No load balancer, no monthly LB cost. (Confirm the exactvip.hcloudschema against current Talos + ksail docs before implementing.)Option B β Hetzner Cloud Load Balancer for
:6443. Provision an LB targeting the 3 CP nodes with a TCP health check on 6443; pointcontrolPlane.endpointat the LB. Health-aware, dead nodes drained automatically. Downside: ~β¬5/mo recurring + one more managed resource. (Note: the existingcilium-gateway-platformLB is for ingress, not the API.)Implementation sketch (whichever option)
ksail.prod.yaml) and/or a Talos machine-config patch underplatform/talos/(whereinstall-image.yaml/apparmor.yamlalready live) soksail cluster create/updaterenderscluster.controlPlane.endpoint= the stable address.cluster.apiServer.certSANs, and machinecertSANs) so TLS validates against the new endpoint. (Today these only contain127.0.0.1; Talos auto-adds node IPs, which is why repointing at a live node IP currently works.)ksail cluster updateto roll the config; regenerateKUBE_CONFIG/TALOS_CONFIGonce against the stable endpoint and push to theprodenv secrets.Acceptance criteria
cluster.controlPlane.endpointresolves to a stable address that survives any single CP node being deleted/recreated.kubectl --context admin@prodworking without a manual kubeconfig repoint or secret refresh.Size: M (one design decision A-vs-B + a Talos config/patch + cert SAN change + a controlled
cluster updateroll on prod). Candidate child of the platform roadmap (#2043). Background + manual refresh recipe:platform/docs/dr/runbook.mdScenario 9.