Give the prod k8s API a stable endpoint (VIP/LB) so CP-node recreates don't break kubeconfig

> 🤖 Generated by the Daily AI Assistant

## Problem

The prod k8s API has **no stable endpoint** — the Talos `cluster.controlPlane.endpoint` is pinned to a **single control-plane node's public IP** (was `https://178.105.204.76:6443`, the original CP-1). When that node is recreated or deleted (routine, per the cattle-CP model in the DR runbook — see [[prod-cd-frozen-by-down-cp-node]] incidents), the endpoint points at a dead host.

**Impact — recurs on every CP-node recreate:**
- The `prod` environment `KUBE_CONFIG` secret (and local `~/.kube/config`) point at a dead IP → the deploy pipeline's `🩺 Verify prod cluster is reachable` preflight fails with `i/o timeout`, evicting PRs from the merge queue and freezing all GitOps delivery.
- `talosctl kubeconfig` **regenerates the same dead endpoint** (it copies `controlPlane.endpoint` verbatim), so a plain regen doesn't fix it — every recovery requires a manual *repoint to a live CP node IP* (DR runbook **Scenario 9**), which is pure toil and a sharp edge during an incident.

Latest occurrence: 2026-06-17 (CP recreates after the nftables ENOBUFS boot-hang); secrets manually refreshed, but the root fragility remains.

## Proposed direction — give the API server a stable, failover-aware endpoint

Two viable options on Hetzner Cloud; both make the endpoint independent of any individual CP node's lifecycle. Recommend **Option A** (native, no recurring cost, fits the cattle-CP philosophy):

**Option A — Talos-managed Hetzner floating-IP VIP (recommended).** Talos natively supports a shared control-plane virtual IP backed by a Hetzner Cloud floating IP (`machine.network.interfaces[].vip.hcloud.*`); Talos elects a leader among the CP nodes via etcd and reassigns the floating IP to it on failover. Point `controlPlane.endpoint` at the floating IP / its DNS name. No load balancer, no monthly LB cost. *(Confirm the exact `vip.hcloud` schema against current Talos + ksail docs before implementing.)*

**Option B — Hetzner Cloud Load Balancer for `:6443`.** Provision an LB targeting the 3 CP nodes with a TCP health check on 6443; point `controlPlane.endpoint` at the LB. Health-aware, dead nodes drained automatically. Downside: ~€5/mo recurring + one more managed resource. (Note: the existing `cilium-gateway-platform` LB is for ingress, not the API.)

## Implementation sketch (whichever option)

1. Reserve the stable address (Hetzner floating IP for A, or LB IP for B); ideally give it a DNS name so the endpoint never changes again.
2. Express it through **ksail** (`ksail.prod.yaml`) and/or a Talos machine-config patch under `platform/talos/` (where `install-image.yaml`/`apparmor.yaml` already live) so `ksail cluster create/update` renders `cluster.controlPlane.endpoint` = the stable address.
3. **Add the stable host to the apiserver cert SANs** (`cluster.apiServer.certSANs`, and machine `certSANs`) so TLS validates against the new endpoint. *(Today these only contain `127.0.0.1`; Talos auto-adds node IPs, which is why repointing at a live node IP currently works.)*
4. Re-run `ksail cluster update` to roll the config; regenerate `KUBE_CONFIG`/`TALOS_CONFIG` once against the stable endpoint and push to the `prod` env secrets.
5. **Simplify DR runbook Scenario 9** — once the endpoint is stable, kubeconfig no longer goes stale on CP recreate; the secret-refresh toil largely disappears.

## Acceptance criteria

- [ ] `cluster.controlPlane.endpoint` resolves to a stable address that survives any single CP node being deleted/recreated.
- [ ] Deleting and recreating any one CP node leaves the deploy preflight and `kubectl --context admin@prod` working **without** a manual kubeconfig repoint or secret refresh.
- [ ] The stable endpoint host is in the apiserver cert SANs (TLS validates).
- [ ] DR runbook Scenario 9 updated to reflect the new (much smaller) refresh story.

**Size:** M (one design decision A-vs-B + a Talos config/patch + cert SAN change + a controlled `cluster update` roll on prod). Candidate child of the platform roadmap (#2043). Background + manual refresh recipe: `platform/docs/dr/runbook.md` Scenario 9.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give the prod k8s API a stable endpoint (VIP/LB) so CP-node recreates don't break kubeconfig #2120

Problem

Proposed direction — give the API server a stable, failover-aware endpoint

Implementation sketch (whichever option)

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Give the prod k8s API a stable endpoint (VIP/LB) so CP-node recreates don't break kubeconfig #2120

Description

Problem

Proposed direction — give the API server a stable, failover-aware endpoint

Implementation sketch (whichever option)

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions