RELAY.Art Failover Drill¶
Runbook for the quarterly RELAY HA failover drill — confirms standby
.241correctly takes over when primary.240fails.
Goal¶
Validate < 5s failover + zero telemetry loss across the drill.
Prerequisites¶
- Test bench (not production)
- RELAY HA pair healthy (
.240primary +.241standby) - Active telemetry source (NetFlow or Syslog from a test DUT)
- Operator role with chaos permissions
Procedure¶
Step 1 — Baseline¶
kubectl get pods -n relay-art
curl -fsS https://dashboard.test-bench/api/relay/status
Step 2 — Capture ingest rate baseline (60s)¶
for i in {1..6}; do
curl -fsS https://dashboard.test-bench/api/relay/metrics | jq .ingest_bps
sleep 10
done
Step 3 — Trigger primary failure¶
kubectl exec -it relay-240 -n relay-art -- \
kill 1 # kills primary container; K8s restarts after the drill
Step 4 — Watch failover (target < 5s)¶
watch -n 0.5 'curl -fsS https://dashboard.test-bench/api/relay/status | jq .leader'
Step 5 — Capture ingest rate during failover¶
for i in {1..6}; do
curl -fsS https://dashboard.test-bench/api/relay/metrics | jq .ingest_bps
sleep 10
done
Compare vs baseline: gap should be < 5s of zero ingest. Any gap larger = drill fail.
Step 6 — Confirm standby took over + leader election fenced¶
# Expected: .241 is now leader; .240 reports "isolated"
kubectl get pods -n relay-art -o wide
Step 7 — Failback (recover primary)¶
After the drill, allow primary to come back online:
kubectl delete pod relay-240 -n relay-art # forces fresh start
kubectl wait --for=condition=ready pod relay-240 -n relay-art --timeout=120s
Wait 30s — standby should yield primary role to .240 (asymmetric
handoff). Verify:
curl -fsS https://dashboard.test-bench/api/relay/status | jq .leader
# expected: relay-240
Rollback¶
If failback fails or telemetry stays broken:
1. Force .241 back to primary:
kubectl exec -it relay-241 -n relay-art -- relay-cli promote
Success criteria¶
- Failover < 5s
- Telemetry gap < 5s
- Failback works after primary recovery
- Audit log captured both events
- No alert flap