Skip to content

Restore from Backup — Standard Operation

Runbook for restoring a single namespace or the full bench from backup. Pairs with ADR 0017.

Goal

Restore bench state to a previous known-good point. Two scenarios: - Partial: single namespace (e.g. web-agents, validator) - Full: entire bench (DR-grade — use the DR drill runbook)

Prerequisites

  • Admin role
  • Recent backup available (velero backup get shows recent Completed)
  • Maintenance window appropriate to restore scope

Partial restore (single namespace)

Step 1 — List available backups for the namespace

velero backup get --output json | \
  jq '.items[] | select(.spec.includedNamespaces | contains(["TARGET_NS"])) | {name: .metadata.name, started: .status.startTimestamp}' | \
  head -10

Step 2 — Pick a backup + preview

velero backup describe BACKUP_NAME --details | head -50

Step 3 — Stop active workloads in the target namespace

kubectl scale --all deployment -n TARGET_NS --replicas=0

Step 4 — Restore

velero restore create restore-$(date +%Y%m%d-%H%M) \
  --from-backup BACKUP_NAME \
  --include-namespaces TARGET_NS \
  --wait

Step 5 — Verify + scale back up

kubectl get pods -n TARGET_NS
kubectl scale --all deployment -n TARGET_NS --replicas=1

Full bench restore

See DR drill runbook — uses the same procedure but in a controlled maintenance window with operator notification + DR drill audit trail.

PostgreSQL point-in-time recovery (PITR)

For database-specific restore (e.g. operator dropped a table):

kubectl exec -it postgres-0 -n platform -- bash -c "
  pg_ctl stop -m fast
  rm -rf /var/lib/postgresql/data/*
  wal-g backup-fetch /var/lib/postgresql/data \$(wal-g backup-list | tail -1 | awk '{print \$1}')
  cat > /var/lib/postgresql/data/recovery.signal <<EOF
recovery_target_time = '2026-05-11 12:00:00'
recovery_target_action = 'pause'
EOF
  pg_ctl start
"

After verifying the data state, promote:

kubectl exec -it postgres-0 -n platform -- pg_ctl promote

Rollback

If the restore turns out incorrect: 1. Take a snapshot of the post-restore state (for forensics) 2. Restore from a different backup (an earlier point)

Never kubectl apply over a partially-restored namespace — namespaces should be wiped between restore attempts to avoid state mixing.

Success criteria

  • Target namespace pods running
  • Smoke tests pass
  • Audit log captures restore event
  • No alert flap