Skip to content

Chaos — RELAY Disconnect Mid-Test

Failure injection runbook: drop the customer-side MGMT NIC on RELAY mid-test + validate telemetry recovery.

Goal

Confirm that a RELAY-customer-MGMT disconnection doesn't lose in-flight telemetry data and HA failover engages cleanly.

Prerequisites

  • Test bench (not production)
  • RELAY HA pair deployed (.240 primary + .241 standby)
  • Active test run with telemetry flowing (NetFlow or Syslog)
  • Operator role with chaos permissions

Procedure

  1. Note current state:

    kubectl get pods -n relay-art
    curl -fsS https://dashboard.test-bench/api/relay/status
    

  2. Drop customer-side iface on primary RELAY:

    kubectl exec -it relay-240 -n relay-art -- \
      ip link set eth1 down
    

  3. Watch for HA failover (target: < 5s):

    watch -n 1 'curl -fsS https://dashboard.test-bench/api/relay/status'
    

  4. Expected behavior:

  5. Primary RELAY reports unhealthy within 5s
  6. Standby .241 takes over (leader election)
  7. Telemetry gap ≤ 5s in FLOW.Art TSDB
  8. Alert fires (RELAYFailoverEvent)
  9. Test run continues without abort

Rollback

kubectl exec -it relay-240 -n relay-art -- \
  ip link set eth1 up

Watch primary recover. Standby releases primary role within 30s (asymmetric handoff to avoid flap loop).

Success criteria

  • HA failover within 5s
  • Telemetry gap < 5s
  • No test abort
  • Alert fired + audit logged
  • Recovery returns to primary within 30s