Skip to content

ADR 0017 — Backup / Restore / Snapshot / DR strategy

  • Status: Accepted (locked 2026-05-08, formalized 2026-05-09)
  • Date: 2026-05-08
  • Deciders: TLSStress.Art project
  • Targets: v4.7+
  • Source memo: project_backup_dr_strategy_2026_05_08.md

Context

The bench is the operator's source of truth for compliance-relevant test reports. A pod restart that loses a 4-hour run, an SSD failure that wipes the catalog, or a misconfigured upgrade that corrupts the database is each a high-pain incident with no current recovery path. Self-Upgrade (ADR 0013) already requires a "pre-upgrade snapshot" pre-req; today that snapshot doesn't exist as a first-class system.

User locked 2026-05-08:

"production-grade RTO < 30 min + RPO < 5 min. Sem isso, não cumprimos profissionalismo."

Decision

Ship a 4-layer backup/DR architecture with operator-friendly defaults (default-on, sane retention, zero-config restore wizard).

Layer 1 — Continuous backup

Always-on, low-overhead, captures every data change:

  • Postgres WAL streaming via WAL-G to an off-pod target (default: in-cluster MinIO bucket; configurable to S3 / GCS / Azure Blob / external MinIO)
  • Volume snapshots via Restic on a 5-min interval for the catalog YAML directory + report artefact directory
  • Retention: 24h continuous + daily snapshots for 30 days

Layer 2 — Scheduled snapshots

Operator-triggered or cron-scheduled application-consistent snapshots:

  • CSI volume snapshots via the cluster's CSI driver (k3s default: local-path-provisioner via Restic)
  • Velero for full namespace backup (PVCs + ConfigMaps + Secrets + custom resources)
  • Retention: 90 days operator-set; up to 1 year for compliance customers

Layer 3 — Restore wizard UI

Dashboard-first restore experience:

  • Browse available restore points (continuous + snapshots) with point-in-time selection (5-min granularity for last 24h, daily for last 30 days, weekly for last 90 days)
  • Pre-restore impact preview (which deployments restart, which data tables touched, expected duration)
  • One-click restore with confirmation modal showing the chosen restore point + estimated RTO
  • Post-restore verification (health checks against /api/health
  • sample report fetch)

Layer 4 — Off-site DR replication

For compliance-bound deployments:

  • Continuous WAL replication to a second MinIO instance in a separate failure domain (different host / different rack / different region)
  • Quarterly DR drill — automated cron that spins up a parallel cluster, restores the latest off-site snapshot, runs a smoke-test report build, and emails the result. Failure pages the operator.

Tooling

Component Tool Why
Postgres WAL streaming WAL-G Self-hosted, S3-compatible, mature
Volume snapshots Restic Encryption-at-rest, dedup, multi-target
Namespace backup Velero k8s-native, restores into different cluster
Off-site target MinIO S3-compatible, self-hostable for airgap

RTO / RPO commitments

Scenario RTO RPO
Single pod crash < 1 min 0 (no loss; PVC retained)
Single node failure < 5 min < 5 min (continuous WAL)
Cluster failure < 30 min < 5 min (off-site replica)
Catastrophic site loss < 4 hours < 1 hour (off-site DR drill)

Roadmap

PR Scope Lands
Backup PR-1 Velero install manifests + restic setup + WAL-G config (no UI yet) next sprint
Backup PR-2 Restore wizard UI skeleton (browse + select + confirm) +1 sprint
Backup PR-3 Pre-restore impact preview +2 sprints
Backup PR-4 Off-site DR replication setup + quarterly drill cron +3 sprints
Backup PR-5 Self-Upgrade integration (snapshot before apply, restore on health-check failure) +3 sprints

Consequences

  • Positive: meets compliance bar (SOC 2 / ISO 27001 require documented backup + tested DR — see ADR 0018); makes Self-Upgrade PR-5 rollback path trustworthy.
  • Negative: operational complexity (4 tools); storage cost (continuous WAL + snapshots).
  • Reversible: each layer is independent; operators can ship Layer 1+2 only and skip 3+4 if scale/budget demands.

References

  • Source memo: project_backup_dr_strategy_2026_05_08.md
  • Cross-ref: ADR 0013 (Self-Upgrade — uses Layer 2 snapshots for rollback)
  • Cross-ref: ADR 0018 (Enterprise compliance — requires documented + tested DR)
  • Cross-ref: project_quality_excellence_policy_2026_05_08.md