ADR 0017 — Backup / Restore / Snapshot / DR strategy¶
- Status: Accepted (locked 2026-05-08, formalized 2026-05-09)
- Date: 2026-05-08
- Deciders: TLSStress.Art project
- Targets: v4.7+
- Source memo:
project_backup_dr_strategy_2026_05_08.md
Context¶
The bench is the operator's source of truth for compliance-relevant test reports. A pod restart that loses a 4-hour run, an SSD failure that wipes the catalog, or a misconfigured upgrade that corrupts the database is each a high-pain incident with no current recovery path. Self-Upgrade (ADR 0013) already requires a "pre-upgrade snapshot" pre-req; today that snapshot doesn't exist as a first-class system.
User locked 2026-05-08:
"production-grade RTO < 30 min + RPO < 5 min. Sem isso, não cumprimos profissionalismo."
Decision¶
Ship a 4-layer backup/DR architecture with operator-friendly defaults (default-on, sane retention, zero-config restore wizard).
Layer 1 — Continuous backup¶
Always-on, low-overhead, captures every data change:
- Postgres WAL streaming via WAL-G to an off-pod target (default: in-cluster MinIO bucket; configurable to S3 / GCS / Azure Blob / external MinIO)
- Volume snapshots via Restic on a 5-min interval for the catalog YAML directory + report artefact directory
- Retention: 24h continuous + daily snapshots for 30 days
Layer 2 — Scheduled snapshots¶
Operator-triggered or cron-scheduled application-consistent snapshots:
- CSI volume snapshots via the cluster's CSI driver (k3s default: local-path-provisioner via Restic)
- Velero for full namespace backup (PVCs + ConfigMaps + Secrets + custom resources)
- Retention: 90 days operator-set; up to 1 year for compliance customers
Layer 3 — Restore wizard UI¶
Dashboard-first restore experience:
- Browse available restore points (continuous + snapshots) with point-in-time selection (5-min granularity for last 24h, daily for last 30 days, weekly for last 90 days)
- Pre-restore impact preview (which deployments restart, which data tables touched, expected duration)
- One-click restore with confirmation modal showing the chosen restore point + estimated RTO
- Post-restore verification (health checks against /api/health
- sample report fetch)
Layer 4 — Off-site DR replication¶
For compliance-bound deployments:
- Continuous WAL replication to a second MinIO instance in a separate failure domain (different host / different rack / different region)
- Quarterly DR drill — automated cron that spins up a parallel cluster, restores the latest off-site snapshot, runs a smoke-test report build, and emails the result. Failure pages the operator.
Tooling¶
| Component | Tool | Why |
|---|---|---|
| Postgres WAL streaming | WAL-G | Self-hosted, S3-compatible, mature |
| Volume snapshots | Restic | Encryption-at-rest, dedup, multi-target |
| Namespace backup | Velero | k8s-native, restores into different cluster |
| Off-site target | MinIO | S3-compatible, self-hostable for airgap |
RTO / RPO commitments¶
| Scenario | RTO | RPO |
|---|---|---|
| Single pod crash | < 1 min | 0 (no loss; PVC retained) |
| Single node failure | < 5 min | < 5 min (continuous WAL) |
| Cluster failure | < 30 min | < 5 min (off-site replica) |
| Catastrophic site loss | < 4 hours | < 1 hour (off-site DR drill) |
Roadmap¶
| PR | Scope | Lands |
|---|---|---|
| Backup PR-1 | Velero install manifests + restic setup + WAL-G config (no UI yet) | next sprint |
| Backup PR-2 | Restore wizard UI skeleton (browse + select + confirm) | +1 sprint |
| Backup PR-3 | Pre-restore impact preview | +2 sprints |
| Backup PR-4 | Off-site DR replication setup + quarterly drill cron | +3 sprints |
| Backup PR-5 | Self-Upgrade integration (snapshot before apply, restore on health-check failure) | +3 sprints |
Consequences¶
- Positive: meets compliance bar (SOC 2 / ISO 27001 require documented backup + tested DR — see ADR 0018); makes Self-Upgrade PR-5 rollback path trustworthy.
- Negative: operational complexity (4 tools); storage cost (continuous WAL + snapshots).
- Reversible: each layer is independent; operators can ship Layer 1+2 only and skip 3+4 if scale/budget demands.
References¶
- Source memo:
project_backup_dr_strategy_2026_05_08.md - Cross-ref: ADR 0013 (Self-Upgrade — uses Layer 2 snapshots for rollback)
- Cross-ref: ADR 0018 (Enterprise compliance — requires documented + tested DR)
- Cross-ref:
project_quality_excellence_policy_2026_05_08.md