Quarterly DR Drill¶
Runbook for the quarterly disaster recovery drill. Pairs with ADR 0017.
Goal¶
Validate that the bench can recover from total loss with RTO < 30 min + RPO < 5 min. Runs once per quarter; results signed off by team lead + audit-logged.
Prerequisites¶
- Scheduled maintenance window (notify operators 48h in advance)
- Recent backup confirmed (
velero backup getshows last 24h success) - Off-site DR replication green (
velero schedule get | grep dr-replica) - Test bench available to receive restore (not production bench)
- Team lead signoff to start
Procedure¶
Phase 1 — Pre-flight (15 min)¶
-
Confirm backup state:
Verify the latest backup showsvelero backup get --output table | tail -5Completed+ age < 24h. -
Confirm WAL-G PostgreSQL WAL archive:
Latest WAL backup age < 5min (RPO target).kubectl exec -it postgres-0 -n platform -- wal-g backup-list -
Snapshot the test bench's current state for rollback safety:
velero backup create pre-drill-snapshot-$(date +%Y%m%d) \ --include-namespaces=test-bench --wait -
Notify operators in #ops Slack channel:
:warning: DR drill starting on test-bench. ETA 30 min. Production bench unaffected. Audit trail in /admin/audit?drill=true.
Phase 2 — Simulated disaster (5 min)¶
-
Wipe the test bench cluster state:
kubectl delete namespace test-bench --wait kubectl delete pv -l app.kubernetes.io/part-of=tlsstress-art --wait -
Confirm namespace + PVs are gone:
kubectl get ns test-bench 2>&1 | grep "not found" kubectl get pv | grep tlsstress-art | wc -l # expect 0
Phase 3 — Restore (15 min, target RTO)¶
Start a stopwatch. Note start time in the audit log.
-
Restore from latest backup:
velero restore create test-bench-restore-$(date +%Y%m%d) \ --from-backup $(velero backup get -o json | jq -r '.items[0].metadata.name') \ --wait -
Replay PostgreSQL WAL to latest committed transaction:
kubectl exec -it postgres-0 -n platform -- wal-g wal-fetch ... -
Validate all MÓDULOs come up:
Expect: zero non-Running pods.kubectl get pods -n test-bench --no-headers | grep -v Running -
Run smoke tests:
curl -fsS https://dashboard.test-bench.tlsstress.art/api/health # expect: {"status":"ok"} -
Note restore complete time. Verify elapsed time < 30 min (RTO target).
Phase 4 — Validation (10 min)¶
- Run a known-good test plan against the restored bench:
- Open dashboard → Test Plans → "DR Drill Smoke v2"
- Verify the plan completes within 5 min
-
Verify report is identical to last quarter's run
-
Check audit log:
Verify restore events captured.curl -fsS https://dashboard.test-bench.tlsstress.art/api/audit?since=1h | jq '.records[] | select(.event_type=="restore")' -
Check telemetry sanity:
- Prometheus targets all green
- Grafana dashboards load without errors
- Alerts not firing falsely
Phase 5 — Documentation (10 min)¶
- Record drill results in
/admin/auditwithdrill=truetag: - Start time / end time
- Elapsed RTO
- Measured RPO (last WAL backup age at disaster time)
- Issues encountered
-
Team lead signoff (PGP signature)
-
Submit DR drill report to compliance archive:
gh issue create --repo nollagluiz/AI_forSE \ --title "DR Drill Q$(date +%q) $(date +%Y) — results" \ --label "compliance,dr-drill" \ --body "$(cat drill-report.md)"
Rollback¶
If the drill fails (RTO breach, missing data, smoke tests fail):
- Do not retry on the test bench — investigate first.
- Restore the test bench from the pre-drill snapshot (Phase 1 step 3):
velero restore create test-bench-rollback-$(date +%Y%m%d) \ --from-backup pre-drill-snapshot-$(date +%Y%m%d) --wait - Write up the root cause + file a P1 issue for the bench team.
Success criteria¶
- RTO < 30 min (start of Phase 3 to end of Phase 4 step 2)
- RPO < 5 min (WAL backup age at disaster time)
- All MÓDULOs running post-restore
- Audit log captured every step
- Smoke test passes
- Team lead signoff completed
- DR drill report filed in compliance archive
Related¶
- ADR 0017
- Backup/DR operator guide
- Memory:
project_backup_dr_strategy_2026_05_08