Skip to content

Quarterly DR Drill

Runbook for the quarterly disaster recovery drill. Pairs with ADR 0017.

Goal

Validate that the bench can recover from total loss with RTO < 30 min + RPO < 5 min. Runs once per quarter; results signed off by team lead + audit-logged.

Prerequisites

  • Scheduled maintenance window (notify operators 48h in advance)
  • Recent backup confirmed (velero backup get shows last 24h success)
  • Off-site DR replication green (velero schedule get | grep dr-replica)
  • Test bench available to receive restore (not production bench)
  • Team lead signoff to start

Procedure

Phase 1 — Pre-flight (15 min)

  1. Confirm backup state:

    velero backup get --output table | tail -5
    
    Verify the latest backup shows Completed + age < 24h.

  2. Confirm WAL-G PostgreSQL WAL archive:

    kubectl exec -it postgres-0 -n platform -- wal-g backup-list
    
    Latest WAL backup age < 5min (RPO target).

  3. Snapshot the test bench's current state for rollback safety:

    velero backup create pre-drill-snapshot-$(date +%Y%m%d) \
      --include-namespaces=test-bench --wait
    

  4. Notify operators in #ops Slack channel:

    :warning: DR drill starting on test-bench. ETA 30 min.
    Production bench unaffected. Audit trail in /admin/audit?drill=true.
    

Phase 2 — Simulated disaster (5 min)

  1. Wipe the test bench cluster state:

    kubectl delete namespace test-bench --wait
    kubectl delete pv -l app.kubernetes.io/part-of=tlsstress-art --wait
    

  2. Confirm namespace + PVs are gone:

    kubectl get ns test-bench 2>&1 | grep "not found"
    kubectl get pv | grep tlsstress-art | wc -l   # expect 0
    

Phase 3 — Restore (15 min, target RTO)

Start a stopwatch. Note start time in the audit log.

  1. Restore from latest backup:

    velero restore create test-bench-restore-$(date +%Y%m%d) \
      --from-backup $(velero backup get -o json | jq -r '.items[0].metadata.name') \
      --wait
    

  2. Replay PostgreSQL WAL to latest committed transaction:

    kubectl exec -it postgres-0 -n platform -- wal-g wal-fetch ...
    

  3. Validate all MÓDULOs come up:

    kubectl get pods -n test-bench --no-headers | grep -v Running
    
    Expect: zero non-Running pods.

  4. Run smoke tests:

    curl -fsS https://dashboard.test-bench.tlsstress.art/api/health
    # expect: {"status":"ok"}
    

  5. Note restore complete time. Verify elapsed time < 30 min (RTO target).

Phase 4 — Validation (10 min)

  1. Run a known-good test plan against the restored bench:
  2. Open dashboard → Test Plans → "DR Drill Smoke v2"
  3. Verify the plan completes within 5 min
  4. Verify report is identical to last quarter's run

  5. Check audit log:

    curl -fsS https://dashboard.test-bench.tlsstress.art/api/audit?since=1h | jq '.records[] | select(.event_type=="restore")'
    
    Verify restore events captured.

  6. Check telemetry sanity:

  7. Prometheus targets all green
  8. Grafana dashboards load without errors
  9. Alerts not firing falsely

Phase 5 — Documentation (10 min)

  1. Record drill results in /admin/audit with drill=true tag:
  2. Start time / end time
  3. Elapsed RTO
  4. Measured RPO (last WAL backup age at disaster time)
  5. Issues encountered
  6. Team lead signoff (PGP signature)

  7. Submit DR drill report to compliance archive:

    gh issue create --repo nollagluiz/AI_forSE \
      --title "DR Drill Q$(date +%q) $(date +%Y) — results" \
      --label "compliance,dr-drill" \
      --body "$(cat drill-report.md)"
    

Rollback

If the drill fails (RTO breach, missing data, smoke tests fail):

  1. Do not retry on the test bench — investigate first.
  2. Restore the test bench from the pre-drill snapshot (Phase 1 step 3):
    velero restore create test-bench-rollback-$(date +%Y%m%d) \
      --from-backup pre-drill-snapshot-$(date +%Y%m%d) --wait
    
  3. Write up the root cause + file a P1 issue for the bench team.

Success criteria

  • RTO < 30 min (start of Phase 3 to end of Phase 4 step 2)
  • RPO < 5 min (WAL backup age at disaster time)
  • All MÓDULOs running post-restore
  • Audit log captured every step
  • Smoke test passes
  • Team lead signoff completed
  • DR drill report filed in compliance archive