Chaos — RELAY Disconnect Mid-Test¶
Failure injection runbook: drop the customer-side MGMT NIC on RELAY mid-test + validate telemetry recovery.
Goal¶
Confirm that a RELAY-customer-MGMT disconnection doesn't lose in-flight telemetry data and HA failover engages cleanly.
Prerequisites¶
- Test bench (not production)
- RELAY HA pair deployed (
.240primary +.241standby) - Active test run with telemetry flowing (NetFlow or Syslog)
- Operator role with chaos permissions
Procedure¶
-
Note current state:
kubectl get pods -n relay-art curl -fsS https://dashboard.test-bench/api/relay/status -
Drop customer-side iface on primary RELAY:
kubectl exec -it relay-240 -n relay-art -- \ ip link set eth1 down -
Watch for HA failover (target: < 5s):
watch -n 1 'curl -fsS https://dashboard.test-bench/api/relay/status' -
Expected behavior:
- Primary RELAY reports unhealthy within 5s
- Standby
.241takes over (leader election) - Telemetry gap ≤ 5s in FLOW.Art TSDB
- Alert fires (
RELAYFailoverEvent) - Test run continues without abort
Rollback¶
kubectl exec -it relay-240 -n relay-art -- \
ip link set eth1 up
Watch primary recover. Standby releases primary role within 30s (asymmetric handoff to avoid flap loop).
Success criteria¶
- HA failover within 5s
- Telemetry gap < 5s
- No test abort
- Alert fired + audit logged
- Recovery returns to primary within 30s