RELAY.Art Failover Drill¶

Runbook for the quarterly RELAY HA failover drill — confirms standby .241 correctly takes over when primary .240 fails.

Goal¶

Validate < 5s failover + zero telemetry loss across the drill.

kubectl get pods -n relay-art
curl -fsS https://dashboard.test-bench/api/relay/status

Note: primary leader, standby ready, current FLOW ingest rate.

for i in {1..6}; do
  curl -fsS https://dashboard.test-bench/api/relay/metrics | jq .ingest_bps
  sleep 10
done

kubectl exec -it relay-240 -n relay-art -- \
  kill 1   # kills primary container; K8s restarts after the drill

watch -n 0.5 'curl -fsS https://dashboard.test-bench/api/relay/status | jq .leader'

for i in {1..6}; do
  curl -fsS https://dashboard.test-bench/api/relay/metrics | jq .ingest_bps
  sleep 10
done

Compare vs baseline: gap should be < 5s of zero ingest. Any gap larger = drill fail.

# Expected: .241 is now leader; .240 reports "isolated"
kubectl get pods -n relay-art -o wide

After the drill, allow primary to come back online:

kubectl delete pod relay-240 -n relay-art   # forces fresh start
kubectl wait --for=condition=ready pod relay-240 -n relay-art --timeout=120s

Wait 30s — standby should yield primary role to .240 (asymmetric handoff). Verify:

curl -fsS https://dashboard.test-bench/api/relay/status | jq .leader
# expected: relay-240

If failback fails or telemetry stays broken: 1. Force .241 back to primary:

kubectl exec -it relay-241 -n relay-art -- relay-cli promote

2. File a P1 ticket for the bench team.