Skip to content

Backup / Restore / DR — Operator Guide

Implements ADR 0017. This guide is the operator-facing how-to for the foundation manifests shipped in k8s/backup/. The full restore-wizard UI lands in Backup PR-2; this guide documents the CLI fallback path that works today.

Scope status (post-Scope-Freeze 2026-05-10) — Operational runbooks for this guide live at: - runbooks/dr-drill.md — quarterly DR drill (RTO < 30 min + RPO < 5 min) - runbooks/restore-from-backup.md — partial NS or full bench restore - runbooks/soc2-evidence-collection.md — compliance evidence collection

Cross-refs: ADR 0018 (compliance), VALIDATOR.Art rebuild runbook.

4-layer architecture (recap)

Layer 1 — Continuous backup
   WAL-G (Postgres)         every 5 min
   Restic (catalog YAML)    every 5 min
   →  in-cluster MinIO
   Retention: 24h continuous + 30 daily snapshots

Layer 2 — Scheduled snapshots
   Velero (full namespace)  nightly 02:00 UTC + weekly Sunday 03:00 UTC
   →  in-cluster MinIO
   Retention: 30 days nightly, 90 days weekly

Layer 3 — Restore wizard UI    (lands in Backup PR-2)
   Browse + select restore point + impact preview + one-click restore

Layer 4 — Off-site DR replication   (lands in Backup PR-4)
   Continuous WAL replication to a second MinIO instance (separate
   failure domain)
   Quarterly DR drill — automated

Bootstrap (one-time, ~10 minutes)

1. Create the backup namespace + MinIO target

kubectl apply -f k8s/backup/00-namespace.yaml

2. Set MinIO root credentials

The shipped manifest has placeholder credentials (REPLACE_BEFORE_APPLY_*). Replace them with real values:

ACCESS_KEY=$(openssl rand -hex 16)
SECRET_KEY=$(openssl rand -hex 32)
kubectl -n web-agents-backup create secret generic minio-root-credentials \
  --from-literal=accesskey="$ACCESS_KEY" \
  --from-literal=secretkey="$SECRET_KEY" \
  --dry-run=client -o yaml | kubectl apply -f -

Save the credentials in a sealed-secret store (1Password, Bitwarden, HashiCorp Vault) — they're needed for the Velero credential below.

3. Apply the MinIO deployment + bucket init

kubectl apply -f k8s/backup/10-minio-target.yaml
kubectl -n web-agents-backup wait --for=condition=Ready pod -l app=minio --timeout=300s
kubectl apply -f k8s/backup/20-bucket-init.yaml

The bucket-init Job creates the three buckets (velero, restic, postgres-wal) with versioning enabled on the two that benefit from it. Idempotent — re-running is safe.

4. Install Velero

Velero itself is a separate install (its CRDs aren't shipped in this repo to avoid coupling our release cadence to theirs). Use the upstream CLI:

# Replace ACCESS_KEY + SECRET_KEY with the values from step 2.
cat > /tmp/credentials-velero <<EOF
[default]
aws_access_key_id = $ACCESS_KEY
aws_secret_access_key = $SECRET_KEY
EOF

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket velero \
  --secret-file /tmp/credentials-velero \
  --use-volume-snapshots=false \
  --use-node-agent \
  --backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=http://minio.web-agents-backup.svc.cluster.local:9000

# Wait for Velero pods to be ready.
kubectl -n velero wait --for=condition=Ready pod -l deploy=velero --timeout=300s

shred -u /tmp/credentials-velero

5. Apply the project-specific Velero CRDs

# Edit 30-velero-config.yaml first to replace REPLACE_BEFORE_APPLY_*
# placeholders with the same ACCESS_KEY + SECRET_KEY from step 2.
kubectl apply -f k8s/backup/30-velero-config.yaml

This installs: - BackupStorageLocation/default — points Velero at the MinIO bucket - VolumeSnapshotLocation/default — uses the same target for snapshots - Schedule/web-agents-nightly — nightly backup at 02:00 UTC, 30-day retention - Schedule/web-agents-weekly-full — weekly full backup at Sunday 03:00 UTC, 90-day retention

6. Verify

# Trigger a manual backup
velero backup create test-bootstrap-$(date +%s) \
  --include-namespaces web-agents \
  --wait

# List backups
velero backup get

# Check the bucket directly
kubectl -n web-agents-backup exec -it deploy/minio -- \
  mc ls minio/velero/

You should see at least one tarball.gz.gz and a .json metadata file under velero/backups/.

Restore (CLI fallback — wizard UI ships in Backup PR-2)

# List restore points
velero backup get

# Restore the most recent backup into a fresh namespace
velero restore create restore-$(date +%s) \
  --from-backup web-agents-nightly-20260509-020000 \
  --wait

For a point-in-time restore (continuous WAL streaming), use WAL-G's restore command — see WAL-G docs for the full syntax.

Disaster Recovery drill (quarterly)

The full drill (Layer 4 with off-site replication) lands in Backup PR-4. For now, the manual drill checklist:

  1. Spin up a parallel cluster on different hardware
  2. Apply the backup namespace + MinIO + bucket-init manifests
  3. Sync the MinIO bucket from production (one-time mc mirror)
  4. Install Velero pointing at the restored bucket
  5. Run velero restore create dr-drill-$(date +%s) --from-backup latest-weekly --wait
  6. Validate: dashboard /api/health returns 200, sample report renders
  7. Document RTO + RPO observed; file in compliance artifact log
  8. Tear down the parallel cluster

Cadence: every 90 days. Failure pages the on-call operator.

What's NOT covered today

  • Restore wizard UI — Backup PR-2 (this guide documents the CLI fallback)
  • Pre-restore impact preview — Backup PR-3
  • Off-site DR replication — Backup PR-4
  • Self-Upgrade rollback integration — Backup PR-5 (cross-ref ADR 0013)
  • Encrypted-at-rest with customer-managed KMS — out of scope for v4.7; in-cluster MinIO uses bucket-level encryption with a key Velero generates
  • ADR 0017 — full design rationale
  • ADR 0013 — Self-Upgrade uses Layer 2 snapshots for the rollback path
  • ADR 0018 — enterprise compliance bar requires documented + tested DR