Skip to content

RELAY.Art Failover Drill

Runbook for the quarterly RELAY HA failover drill — confirms standby .241 correctly takes over when primary .240 fails.

Goal

Validate < 5s failover + zero telemetry loss across the drill.

Prerequisites

  • Test bench (not production)
  • RELAY HA pair healthy (.240 primary + .241 standby)
  • Active telemetry source (NetFlow or Syslog from a test DUT)
  • Operator role with chaos permissions

Procedure

Step 1 — Baseline

kubectl get pods -n relay-art
curl -fsS https://dashboard.test-bench/api/relay/status
Note: primary leader, standby ready, current FLOW ingest rate.

Step 2 — Capture ingest rate baseline (60s)

for i in {1..6}; do
  curl -fsS https://dashboard.test-bench/api/relay/metrics | jq .ingest_bps
  sleep 10
done

Step 3 — Trigger primary failure

kubectl exec -it relay-240 -n relay-art -- \
  kill 1   # kills primary container; K8s restarts after the drill

Step 4 — Watch failover (target < 5s)

watch -n 0.5 'curl -fsS https://dashboard.test-bench/api/relay/status | jq .leader'

Step 5 — Capture ingest rate during failover

for i in {1..6}; do
  curl -fsS https://dashboard.test-bench/api/relay/metrics | jq .ingest_bps
  sleep 10
done

Compare vs baseline: gap should be < 5s of zero ingest. Any gap larger = drill fail.

Step 6 — Confirm standby took over + leader election fenced

# Expected: .241 is now leader; .240 reports "isolated"
kubectl get pods -n relay-art -o wide

Step 7 — Failback (recover primary)

After the drill, allow primary to come back online:

kubectl delete pod relay-240 -n relay-art   # forces fresh start
kubectl wait --for=condition=ready pod relay-240 -n relay-art --timeout=120s

Wait 30s — standby should yield primary role to .240 (asymmetric handoff). Verify:

curl -fsS https://dashboard.test-bench/api/relay/status | jq .leader
# expected: relay-240

Rollback

If failback fails or telemetry stays broken: 1. Force .241 back to primary:

kubectl exec -it relay-241 -n relay-art -- relay-cli promote
2. File a P1 ticket for the bench team.

Success criteria

  • Failover < 5s
  • Telemetry gap < 5s
  • Failback works after primary recovery
  • Audit log captured both events
  • No alert flap