ADR-0105: F2 cell-0 high availability (remove the single-VPS SPOF)¶

Status: Proposed — executable plan; not yet applied. Supersedes the Wave-1 single-node posture in ADR-0053 for the cell-0 control plane.
Date: 2026-06-13
Driver: F2 auto-provisioning is LIVE on a single Hetzner VPS (cell-0). One node hosts K3s + Traefik + cert-manager + Temporal + Postgres — a hard single-point-of-failure for the provisioning control plane. Node loss = no new tenant onboarding + loss of the cell substrate DB (tenants/slots/quota).
Relates to: ADR-0053 (cell-based hyperscale), the F2 wiring + Hetzner cell architecture (docs/F2_HETZNER_CELL0_ARCHITECTURE.md), and the Wave-2 slot admission + ledger-quota reconciliation that read/write the cell Postgres.

Context¶

cell-0 today (docs/F2_HETZNER_CELL0_ARCHITECTURE.md):

Component	Today (Wave-1)	Failure impact
K3s	single server (SQLite-backed)	API server down → no scheduling, no reconcile
Postgres	single pod, local PVC	data-loss risk — tenants/slots/quota gone
Temporal	single, Postgres-backed	onboarding workflows stall + lose history
Traefik / ingress	single replica	public `f2.tlsstress.art` trigger unreachable
cert-manager	single	cert renewal stalls (LE certs ~60-day buffer)
Public IP	one VPS IP	node loss = endpoint gone until manual rebuild

This is acceptable for the dormant / pre-revenue posture (the memory records F2 as DORMANT in prod), but it is not acceptable once F2 carries live customer onboarding. The substrate Postgres is the critical asset: it is the source of truth the data plane (connect-art / stun-coord admission, ADR-0053) and billing reconciliation depend on.

This ADR is infrastructure — it cannot be delivered as application code in a single change; it is the executable plan + runbook to flip cell-0 from single-node to HA. The application code is already HA-ready (Temporal scales horizontally; the cell Manager + admission.Checker + quota Reconciler are stateless over Postgres; provisioning activities are idempotent — see ADR-0053).

Decision¶

Adopt a 3-node K3s HA cluster for cell-0 with replicated Postgres, behind a Hetzner load balancer + floating IP. Three nodes (not two) because K3s embedded etcd needs an odd quorum (≥3) to tolerate one node loss without losing the API server.

Target topology¶

                 Hetzner LB (f2.tlsstress.art)  +  Floating IP
                              │  (health-checked :443)
        ┌─────────────────────┼─────────────────────┐
   ┌────▼────┐           ┌────▼────┐           ┌────▼────┐
   │ node-1  │           │ node-2  │           │ node-3  │   cx33 each
   │ k3s srv │           │ k3s srv │           │ k3s srv │   (embedded etcd
   │ +etcd   │◄─ raft ──►│ +etcd   │◄─ raft ──►│ +etcd   │    quorum = 3)
   └────┬────┘           └────┬────┘           └────┬────┘
        │  Traefik (DaemonSet/2+) · cert-manager (leader-elected)
        │  Temporal frontend/history/matching (≥2 replicas, leader-elected)
        ▼
   Postgres HA  ── Patroni + Spilo (1 primary + 2 sync/async replicas)
                   OR Hetzner-managed Postgres if/when offered in the region

Component decisions¶

K3s control plane: 3 servers with --cluster-init (embedded etcd, raft). Tolerates 1 node loss. No external DB needed (drops the SQLite SPOF).
Postgres: Patroni + Spilo StatefulSet (3 replicas, 1 primary, sync standby) with a postgres Service that always points at the leader. WAL archiving to Hetzner Object Storage (S3-compatible) or AWS S3 for PITR — this is the disaster-recovery backstop independent of node count. Patroni handles automatic failover + leader election.
Temporal: scale frontend/history/matching to ≥2 replicas each; Temporal is already designed for HA over a shared (now-replicated) Postgres.
Ingress / public endpoint: Hetzner LB with a TCP/HTTPS health check on the Traefik node-port; a floating IP as the DNS A-record target so a node swap doesn't change f2.tlsstress.art. Traefik runs as a DaemonSet (or ≥2 replicas) so every node can terminate ingress.
cert-manager: single logical instance with leader election (default) — already HA-safe across 3 nodes; LE certs survive a ~60-day window regardless.
DNS: f2.tlsstress.art → floating IP (Cloudflare, proxied as today). The LB health check + floating-IP reassignment is the failover path; DNS TTL stays low (60s) as a secondary path.

Non-goals¶

Multi-region cell-0 (this is single-region HA; cross-region is a Wave-N cell fan-out per ADR-0053, not this ADR).
Zero-RPO. Sync standby gives near-zero RPO for committed txns; WAL-to-S3 is the catastrophic-loss backstop (RPO = archive interval).

Executable plan¶

Phase 0 — backstop FIRST (do before any topology change)¶

Even while single-node, the substrate DB must be recoverable: 1. Enable WAL archiving from the current Postgres to S3 (Hetzner Object Storage or the existing AWS tlsstress-* bucket) + a nightly pg_dump to the same bucket. This alone removes the data-loss risk (the worst failure) at ~$0. 2. Verify a restore into a throwaway Postgres (pg_restore + WAL replay) — a backup is not a backup until a restore is proven.

Phase 1 — provision 2 more nodes (Terraform)¶

Extend the existing hcloud Terraform (the F2-H1 firewall module):

# 3 servers, same type/image/firewall as cell-0 node-1.
resource "hcloud_server" "cell0" {
  count       = 3
  name        = "cell0-node-${count.index + 1}"
  server_type = "cx33"
  image       = "ubuntu-24.04"
  location    = "hel1"
  firewall_ids = [hcloud_firewall.cell0.id]
  # cloud-init installs k3s in server mode (see Phase 2).
}
resource "hcloud_load_balancer" "cell0" { name = "cell0-lb"; load_balancer_type = "lb11"; location = "hel1" }
resource "hcloud_floating_ip"   "cell0" { type = "ipv4"; home_location = "hel1" }
# hcloud_load_balancer_target / _service: HTTPS :443 → Traefik node-port, health-check /healthz.

Phase 2 — K3s embedded-etcd HA¶

node-1 (existing): re-init with --cluster-init (etcd). This requires a rebuild of the K3s datastore — do it AFTER Phase 0 backups + a maintenance window, OR stand up a fresh 3-node cluster and migrate (preferred: less risk).
node-2/3: k3s server --server https://<node-1>:6443 --token <T>.
Verify kubectl get nodes shows 3 control-plane,etcd,master Ready; kill one node and confirm the API server stays up.

Phase 3 — Postgres HA (Patroni/Spilo)¶

Deploy the Spilo StatefulSet (3 replicas) + a postgres-primary Service selecting the Patroni leader.
Migrate data: pg_dump from the current single Postgres → restore into the new primary (the substrate is small: tenants/slots/quota). Keep WAL archiving on.
Repoint every consumer DSN (OCTOPUS_CELL_DATABASE_URL, Temporal, provisioner) at the postgres-primary Service. The data plane (connect-art/stun-coord) + reconciler pick it up via the same env (no code change — ADR-0053 / Wave-2).

Phase 4 — Temporal + ingress scale-out¶

kubectl scale Temporal frontend/history/matching to 2; Traefik to a DaemonSet; cert-manager unchanged (leader-elected).
Point the Hetzner LB at the Traefik node-port; move the f2.tlsstress.art A-record to the floating IP; set the LB health check to the Traefik /healthz.

Phase 5 — failure drill (acceptance)¶

Kill node-1 (the original): confirm (a) kubectl still works via node-2/3, (b) Postgres fails over (Patroni promotes a standby), (c) the public POST /trigger/onboarding still succeeds (LB reroutes), (d) an onboarding workflow started before the kill resumes (Temporal HA).
Restore node-1; confirm it rejoins etcd + Postgres re-replicates.

Consequences¶

Cost: 1 → 3 × cx33 (~$12 → ~$36/mo) + LB (~$6/mo) + floating IP + object storage for WAL (~$1) ≈ ~$45/mo. Justified only once F2 carries live onboarding; until then, Phase 0 alone (backups, ~$1/mo) removes the data-loss risk and is the recommended immediate action.
Operational: Patroni adds a moving part (failover logic) — but it is the standard K8s Postgres-HA pattern and well-trodden.
No app changes: the stateless-over-Postgres design (ADR-0053) means HA is purely an infra/topology change; the provisioner, cell Manager, admission Checker, and quota Reconciler are unmodified.
Reversible: each phase is independently revertible; Phase 0 is pure upside with no topology change.

Recommendation¶

Do Phase 0 now (backups + proven restore — removes the catastrophic risk at ~$1/mo, no downtime). Defer Phases 1–5 until F2 onboarding goes from dormant to live; trigger them on the first paying tenant. The 3-node build is a ~1-day operator task following this runbook.