Restore from Backup — Standard Operation¶

Runbook for restoring a single namespace or the full bench from backup. Pairs with ADR 0017.

Goal¶

Restore bench state to a previous known-good point. Two scenarios: - Partial: single namespace (e.g. web-agents, validator) - Full: entire bench (DR-grade — use the DR drill runbook)

Prerequisites¶

Admin role
Recent backup available (velero backup get shows recent Completed)
Maintenance window appropriate to restore scope

Partial restore (single namespace)¶

Step 1 — List available backups for the namespace¶

velero backup get --output json | \
  jq '.items[] | select(.spec.includedNamespaces | contains(["TARGET_NS"])) | {name: .metadata.name, started: .status.startTimestamp}' | \
  head -10

Step 2 — Pick a backup + preview¶

velero backup describe BACKUP_NAME --details | head -50

Step 3 — Stop active workloads in the target namespace¶

kubectl scale --all deployment -n TARGET_NS --replicas=0

Step 4 — Restore¶

velero restore create restore-$(date +%Y%m%d-%H%M) \
  --from-backup BACKUP_NAME \
  --include-namespaces TARGET_NS \
  --wait

Step 5 — Verify + scale back up¶

kubectl get pods -n TARGET_NS
kubectl scale --all deployment -n TARGET_NS --replicas=1

Full bench restore¶

See DR drill runbook — uses the same procedure but in a controlled maintenance window with operator notification + DR drill audit trail.

PostgreSQL point-in-time recovery (PITR)¶

For database-specific restore (e.g. operator dropped a table):

kubectl exec -it postgres-0 -n platform -- bash -c "
  pg_ctl stop -m fast
  rm -rf /var/lib/postgresql/data/*
  wal-g backup-fetch /var/lib/postgresql/data \$(wal-g backup-list | tail -1 | awk '{print \$1}')
  cat > /var/lib/postgresql/data/recovery.signal <<EOF
recovery_target_time = '2026-05-11 12:00:00'
recovery_target_action = 'pause'
EOF
  pg_ctl start
"

After verifying the data state, promote:

kubectl exec -it postgres-0 -n platform -- pg_ctl promote

Rollback¶

If the restore turns out incorrect: 1. Take a snapshot of the post-restore state (for forensics) 2. Restore from a different backup (an earlier point)

Never kubectl apply over a partially-restored namespace — namespaces should be wiped between restore attempts to avoid state mixing.

Success criteria¶

Target namespace pods running
Smoke tests pass
Audit log captures restore event
No alert flap