Quarterly DR Drill¶

Runbook for the quarterly disaster recovery drill. Pairs with ADR 0017.

Goal¶

Validate that the bench can recover from total loss with RTO < 30 min + RPO < 5 min. Runs once per quarter; results signed off by team lead + audit-logged.

Prerequisites¶

Scheduled maintenance window (notify operators 48h in advance)
Recent backup confirmed (velero backup get shows last 24h success)
Off-site DR replication green (velero schedule get | grep dr-replica)
Test bench available to receive restore (not production bench)
Team lead signoff to start

Procedure¶

Phase 1 — Pre-flight (15 min)¶

Confirm backup state:
```
velero backup get --output table | tail -5
```
Verify the latest backup shows Completed + age < 24h.
Confirm WAL-G PostgreSQL WAL archive:
```
kubectl exec -it postgres-0 -n platform -- wal-g backup-list
```
Latest WAL backup age < 5min (RPO target).

Snapshot the test bench's current state for rollback safety:

velero backup create pre-drill-snapshot-$(date +%Y%m%d) \
  --include-namespaces=test-bench --wait

Notify operators in #ops Slack channel:

:warning: DR drill starting on test-bench. ETA 30 min.
Production bench unaffected. Audit trail in /admin/audit?drill=true.

Phase 2 — Simulated disaster (5 min)¶

Wipe the test bench cluster state:

kubectl delete namespace test-bench --wait
kubectl delete pv -l app.kubernetes.io/part-of=tlsstress-art --wait

Confirm namespace + PVs are gone:

kubectl get ns test-bench 2>&1 | grep "not found"
kubectl get pv | grep tlsstress-art | wc -l   # expect 0

Phase 3 — Restore (15 min, target RTO)¶

Start a stopwatch. Note start time in the audit log.

Restore from latest backup:

velero restore create test-bench-restore-$(date +%Y%m%d) \
  --from-backup $(velero backup get -o json | jq -r '.items[0].metadata.name') \
  --wait

Replay PostgreSQL WAL to latest committed transaction:

kubectl exec -it postgres-0 -n platform -- wal-g wal-fetch ...

Validate all MÓDULOs come up:

kubectl get pods -n test-bench --no-headers | grep -v Running

Expect: zero non-Running pods.

Run smoke tests:

curl -fsS https://dashboard.test-bench.tlsstress.art/api/health
# expect: {"status":"ok"}

Note restore complete time. Verify elapsed time < 30 min (RTO target).

Phase 4 — Validation (10 min)¶

Run a known-good test plan against the restored bench:
Open dashboard → Test Plans → "DR Drill Smoke v2"
Verify the plan completes within 5 min
Verify report is identical to last quarter's run

Check audit log:

curl -fsS https://dashboard.test-bench.tlsstress.art/api/audit?since=1h | jq '.records[] | select(.event_type=="restore")'

Verify restore events captured.

Check telemetry sanity:
Prometheus targets all green
Grafana dashboards load without errors
Alerts not firing falsely

Phase 5 — Documentation (10 min)¶

Record drill results in /admin/audit with drill=true tag:
Start time / end time
Elapsed RTO
Measured RPO (last WAL backup age at disaster time)
Issues encountered
Team lead signoff (PGP signature)

Submit DR drill report to compliance archive:

gh issue create --repo nollagluiz/AI_forSE \
  --title "DR Drill Q$(date +%q) $(date +%Y) — results" \
  --label "compliance,dr-drill" \
  --body "$(cat drill-report.md)"

Rollback¶

If the drill fails (RTO breach, missing data, smoke tests fail):

Do not retry on the test bench — investigate first.

Restore the test bench from the pre-drill snapshot (Phase 1 step 3):

velero restore create test-bench-rollback-$(date +%Y%m%d) \
  --from-backup pre-drill-snapshot-$(date +%Y%m%d) --wait

Write up the root cause + file a P1 issue for the bench team.

Success criteria¶

RTO < 30 min (start of Phase 3 to end of Phase 4 step 2)
RPO < 5 min (WAL backup age at disaster time)
All MÓDULOs running post-restore
Audit log captured every step
Smoke test passes
Team lead signoff completed
DR drill report filed in compliance archive

ADR 0017
Backup/DR operator guide
Memory: project_backup_dr_strategy_2026_05_08