ADR-0105: F2 cell-0 high availability (remove the single-VPS SPOF)¶
- Status: Proposed — executable plan; not yet applied. Supersedes the Wave-1 single-node posture in ADR-0053 for the cell-0 control plane.
- Date: 2026-06-13
- Driver: F2 auto-provisioning is LIVE on a single Hetzner VPS (cell-0). One node hosts K3s + Traefik + cert-manager + Temporal + Postgres — a hard single-point-of-failure for the provisioning control plane. Node loss = no new tenant onboarding + loss of the cell substrate DB (tenants/slots/quota).
- Relates to: ADR-0053 (cell-based hyperscale), the F2 wiring + Hetzner cell
architecture (
docs/F2_HETZNER_CELL0_ARCHITECTURE.md), and the Wave-2 slot admission + ledger-quota reconciliation that read/write the cell Postgres.
Context¶
cell-0 today (docs/F2_HETZNER_CELL0_ARCHITECTURE.md):
| Component | Today (Wave-1) | Failure impact |
|---|---|---|
| K3s | single server (SQLite-backed) | API server down → no scheduling, no reconcile |
| Postgres | single pod, local PVC | data-loss risk — tenants/slots/quota gone |
| Temporal | single, Postgres-backed | onboarding workflows stall + lose history |
| Traefik / ingress | single replica | public f2.tlsstress.art trigger unreachable |
| cert-manager | single | cert renewal stalls (LE certs ~60-day buffer) |
| Public IP | one VPS IP | node loss = endpoint gone until manual rebuild |
This is acceptable for the dormant / pre-revenue posture (the memory records F2 as DORMANT in prod), but it is not acceptable once F2 carries live customer onboarding. The substrate Postgres is the critical asset: it is the source of truth the data plane (connect-art / stun-coord admission, ADR-0053) and billing reconciliation depend on.
This ADR is infrastructure — it cannot be delivered as application code in a
single change; it is the executable plan + runbook to flip cell-0 from
single-node to HA. The application code is already HA-ready (Temporal scales
horizontally; the cell Manager + admission.Checker + quota Reconciler are
stateless over Postgres; provisioning activities are idempotent — see ADR-0053).
Decision¶
Adopt a 3-node K3s HA cluster for cell-0 with replicated Postgres, behind a Hetzner load balancer + floating IP. Three nodes (not two) because K3s embedded etcd needs an odd quorum (≥3) to tolerate one node loss without losing the API server.
Target topology¶
Hetzner LB (f2.tlsstress.art) + Floating IP
│ (health-checked :443)
┌─────────────────────┼─────────────────────┐
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ node-1 │ │ node-2 │ │ node-3 │ cx33 each
│ k3s srv │ │ k3s srv │ │ k3s srv │ (embedded etcd
│ +etcd │◄─ raft ──►│ +etcd │◄─ raft ──►│ +etcd │ quorum = 3)
└────┬────┘ └────┬────┘ └────┬────┘
│ Traefik (DaemonSet/2+) · cert-manager (leader-elected)
│ Temporal frontend/history/matching (≥2 replicas, leader-elected)
▼
Postgres HA ── Patroni + Spilo (1 primary + 2 sync/async replicas)
OR Hetzner-managed Postgres if/when offered in the region
Component decisions¶
- K3s control plane: 3 servers with
--cluster-init(embedded etcd, raft). Tolerates 1 node loss. No external DB needed (drops the SQLite SPOF). - Postgres: Patroni + Spilo StatefulSet (3 replicas, 1 primary, sync
standby) with a
postgresService that always points at the leader. WAL archiving to Hetzner Object Storage (S3-compatible) or AWS S3 for PITR — this is the disaster-recovery backstop independent of node count. Patroni handles automatic failover + leader election. - Temporal: scale frontend/history/matching to ≥2 replicas each; Temporal is already designed for HA over a shared (now-replicated) Postgres.
- Ingress / public endpoint: Hetzner LB with a TCP/HTTPS health check on
the Traefik node-port; a floating IP as the DNS A-record target so a node
swap doesn't change
f2.tlsstress.art. Traefik runs as a DaemonSet (or ≥2 replicas) so every node can terminate ingress. - cert-manager: single logical instance with leader election (default) — already HA-safe across 3 nodes; LE certs survive a ~60-day window regardless.
- DNS:
f2.tlsstress.art→ floating IP (Cloudflare, proxied as today). The LB health check + floating-IP reassignment is the failover path; DNS TTL stays low (60s) as a secondary path.
Non-goals¶
- Multi-region cell-0 (this is single-region HA; cross-region is a Wave-N cell fan-out per ADR-0053, not this ADR).
- Zero-RPO. Sync standby gives near-zero RPO for committed txns; WAL-to-S3 is the catastrophic-loss backstop (RPO = archive interval).
Executable plan¶
Phase 0 — backstop FIRST (do before any topology change)¶
Even while single-node, the substrate DB must be recoverable:
1. Enable WAL archiving from the current Postgres to S3 (Hetzner Object Storage
or the existing AWS tlsstress-* bucket) + a nightly pg_dump to the same
bucket. This alone removes the data-loss risk (the worst failure) at ~$0.
2. Verify a restore into a throwaway Postgres (pg_restore + WAL replay) — a
backup is not a backup until a restore is proven.
Phase 1 — provision 2 more nodes (Terraform)¶
Extend the existing hcloud Terraform (the F2-H1 firewall module):
# 3 servers, same type/image/firewall as cell-0 node-1.
resource "hcloud_server" "cell0" {
count = 3
name = "cell0-node-${count.index + 1}"
server_type = "cx33"
image = "ubuntu-24.04"
location = "hel1"
firewall_ids = [hcloud_firewall.cell0.id]
# cloud-init installs k3s in server mode (see Phase 2).
}
resource "hcloud_load_balancer" "cell0" { name = "cell0-lb"; load_balancer_type = "lb11"; location = "hel1" }
resource "hcloud_floating_ip" "cell0" { type = "ipv4"; home_location = "hel1" }
# hcloud_load_balancer_target / _service: HTTPS :443 → Traefik node-port, health-check /healthz.
Phase 2 — K3s embedded-etcd HA¶
- node-1 (existing): re-init with
--cluster-init(etcd). This requires a rebuild of the K3s datastore — do it AFTER Phase 0 backups + a maintenance window, OR stand up a fresh 3-node cluster and migrate (preferred: less risk). - node-2/3:
k3s server --server https://<node-1>:6443 --token <T>. - Verify
kubectl get nodesshows 3control-plane,etcd,masterReady; kill one node and confirm the API server stays up.
Phase 3 — Postgres HA (Patroni/Spilo)¶
- Deploy the Spilo StatefulSet (3 replicas) + a
postgres-primaryService selecting the Patroni leader. - Migrate data:
pg_dumpfrom the current single Postgres → restore into the new primary (the substrate is small: tenants/slots/quota). Keep WAL archiving on. - Repoint every consumer DSN (
OCTOPUS_CELL_DATABASE_URL, Temporal, provisioner) at thepostgres-primaryService. The data plane (connect-art/stun-coord) + reconciler pick it up via the same env (no code change — ADR-0053 / Wave-2).
Phase 4 — Temporal + ingress scale-out¶
kubectl scaleTemporal frontend/history/matching to 2; Traefik to a DaemonSet; cert-manager unchanged (leader-elected).- Point the Hetzner LB at the Traefik node-port; move the
f2.tlsstress.artA-record to the floating IP; set the LB health check to the Traefik/healthz.
Phase 5 — failure drill (acceptance)¶
- Kill node-1 (the original): confirm (a)
kubectlstill works via node-2/3, (b) Postgres fails over (Patroni promotes a standby), (c) the publicPOST /trigger/onboardingstill succeeds (LB reroutes), (d) an onboarding workflow started before the kill resumes (Temporal HA). - Restore node-1; confirm it rejoins etcd + Postgres re-replicates.
Consequences¶
- Cost: 1 → 3 × cx33 (~$12 → ~$36/mo) + LB (~$6/mo) + floating IP + object storage for WAL (~$1) ≈ ~$45/mo. Justified only once F2 carries live onboarding; until then, Phase 0 alone (backups, ~$1/mo) removes the data-loss risk and is the recommended immediate action.
- Operational: Patroni adds a moving part (failover logic) — but it is the standard K8s Postgres-HA pattern and well-trodden.
- No app changes: the stateless-over-Postgres design (ADR-0053) means HA is
purely an infra/topology change; the provisioner, cell
Manager, admissionChecker, and quotaReconcilerare unmodified. - Reversible: each phase is independently revertible; Phase 0 is pure upside with no topology change.
Recommendation¶
Do Phase 0 now (backups + proven restore — removes the catastrophic risk at ~$1/mo, no downtime). Defer Phases 1–5 until F2 onboarding goes from dormant to live; trigger them on the first paying tenant. The 3-node build is a ~1-day operator task following this runbook.