Backup / Restore / DR — Operator Guide¶
Implements ADR 0017. This guide is the operator-facing how-to for the foundation manifests shipped in
k8s/backup/. The full restore-wizard UI lands in Backup PR-2; this guide documents the CLI fallback path that works today.Scope status (post-Scope-Freeze 2026-05-10) — Operational runbooks for this guide live at: -
runbooks/dr-drill.md— quarterly DR drill (RTO < 30 min + RPO < 5 min) -runbooks/restore-from-backup.md— partial NS or full bench restore -runbooks/soc2-evidence-collection.md— compliance evidence collectionCross-refs: ADR 0018 (compliance), VALIDATOR.Art rebuild runbook.
4-layer architecture (recap)¶
Layer 1 — Continuous backup
WAL-G (Postgres) every 5 min
Restic (catalog YAML) every 5 min
→ in-cluster MinIO
Retention: 24h continuous + 30 daily snapshots
Layer 2 — Scheduled snapshots
Velero (full namespace) nightly 02:00 UTC + weekly Sunday 03:00 UTC
→ in-cluster MinIO
Retention: 30 days nightly, 90 days weekly
Layer 3 — Restore wizard UI (lands in Backup PR-2)
Browse + select restore point + impact preview + one-click restore
Layer 4 — Off-site DR replication (lands in Backup PR-4)
Continuous WAL replication to a second MinIO instance (separate
failure domain)
Quarterly DR drill — automated
Bootstrap (one-time, ~10 minutes)¶
1. Create the backup namespace + MinIO target¶
kubectl apply -f k8s/backup/00-namespace.yaml
2. Set MinIO root credentials¶
The shipped manifest has placeholder credentials (REPLACE_BEFORE_APPLY_*).
Replace them with real values:
ACCESS_KEY=$(openssl rand -hex 16)
SECRET_KEY=$(openssl rand -hex 32)
kubectl -n web-agents-backup create secret generic minio-root-credentials \
--from-literal=accesskey="$ACCESS_KEY" \
--from-literal=secretkey="$SECRET_KEY" \
--dry-run=client -o yaml | kubectl apply -f -
Save the credentials in a sealed-secret store (1Password, Bitwarden, HashiCorp Vault) — they're needed for the Velero credential below.
3. Apply the MinIO deployment + bucket init¶
kubectl apply -f k8s/backup/10-minio-target.yaml
kubectl -n web-agents-backup wait --for=condition=Ready pod -l app=minio --timeout=300s
kubectl apply -f k8s/backup/20-bucket-init.yaml
The bucket-init Job creates the three buckets (velero, restic,
postgres-wal) with versioning enabled on the two that benefit
from it. Idempotent — re-running is safe.
4. Install Velero¶
Velero itself is a separate install (its CRDs aren't shipped in this repo to avoid coupling our release cadence to theirs). Use the upstream CLI:
# Replace ACCESS_KEY + SECRET_KEY with the values from step 2.
cat > /tmp/credentials-velero <<EOF
[default]
aws_access_key_id = $ACCESS_KEY
aws_secret_access_key = $SECRET_KEY
EOF
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket velero \
--secret-file /tmp/credentials-velero \
--use-volume-snapshots=false \
--use-node-agent \
--backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=http://minio.web-agents-backup.svc.cluster.local:9000
# Wait for Velero pods to be ready.
kubectl -n velero wait --for=condition=Ready pod -l deploy=velero --timeout=300s
shred -u /tmp/credentials-velero
5. Apply the project-specific Velero CRDs¶
# Edit 30-velero-config.yaml first to replace REPLACE_BEFORE_APPLY_*
# placeholders with the same ACCESS_KEY + SECRET_KEY from step 2.
kubectl apply -f k8s/backup/30-velero-config.yaml
This installs:
- BackupStorageLocation/default — points Velero at the MinIO bucket
- VolumeSnapshotLocation/default — uses the same target for snapshots
- Schedule/web-agents-nightly — nightly backup at 02:00 UTC, 30-day retention
- Schedule/web-agents-weekly-full — weekly full backup at Sunday 03:00 UTC, 90-day retention
6. Verify¶
# Trigger a manual backup
velero backup create test-bootstrap-$(date +%s) \
--include-namespaces web-agents \
--wait
# List backups
velero backup get
# Check the bucket directly
kubectl -n web-agents-backup exec -it deploy/minio -- \
mc ls minio/velero/
You should see at least one tarball.gz.gz and a .json metadata
file under velero/backups/.
Restore (CLI fallback — wizard UI ships in Backup PR-2)¶
# List restore points
velero backup get
# Restore the most recent backup into a fresh namespace
velero restore create restore-$(date +%s) \
--from-backup web-agents-nightly-20260509-020000 \
--wait
For a point-in-time restore (continuous WAL streaming), use WAL-G's restore command — see WAL-G docs for the full syntax.
Disaster Recovery drill (quarterly)¶
The full drill (Layer 4 with off-site replication) lands in Backup PR-4. For now, the manual drill checklist:
- Spin up a parallel cluster on different hardware
- Apply the backup namespace + MinIO + bucket-init manifests
- Sync the MinIO bucket from production (one-time
mc mirror) - Install Velero pointing at the restored bucket
- Run
velero restore create dr-drill-$(date +%s) --from-backup latest-weekly --wait - Validate: dashboard
/api/healthreturns 200, sample report renders - Document RTO + RPO observed; file in compliance artifact log
- Tear down the parallel cluster
Cadence: every 90 days. Failure pages the on-call operator.
What's NOT covered today¶
- Restore wizard UI — Backup PR-2 (this guide documents the CLI fallback)
- Pre-restore impact preview — Backup PR-3
- Off-site DR replication — Backup PR-4
- Self-Upgrade rollback integration — Backup PR-5 (cross-ref ADR 0013)
- Encrypted-at-rest with customer-managed KMS — out of scope for v4.7; in-cluster MinIO uses bucket-level encryption with a key Velero generates