Disaster Recovery — Restore Runbook (Hetzner · Cloudflare · AWS from R2)¶
Audience: the operator (or a successor) rebuilding tlsstress.art after the loss of a host, an account, a region, or the whole estate. Scope: restore every environment from the off-site backups in Cloudflare R2 (
r2://tlsstress-obs-backups/) plus the two always-available sources of truth: the git repository (all config/IaC) and AWS Secrets Manager (secret values).Companion docs: observability README · RUNBOOK (day-2 ops) · AWS architecture suite under
docs/aws/. This document is recovery, not day-2.
0. Recovery objectives¶
| Tier | Asset | RPO (max data loss) | RTO (time to restore) |
|---|---|---|---|
| 0 | Business DB (RDS Postgres) | ≤ 6h (R2 dump every 6h) · ≤ 5min (RDS PITR, if account intact) | 1–2h |
| 1 | Observability stack (Hetzner) | ≤ 24h (daily R2 volume snapshot) | 30–60min |
| 1 | Cloudflare config (DNS/WAF/Workers) | ≤ 24h (daily R2 export) | 30min |
| 2 | App images (customer-app/admin) | 0 (rebuilt from git) | 20–40min (CI build+deploy) |
| 2 | Infra (VPC/RDS/AppRunner/IAM) | 0 (Terraform in git) | 1–3h (terraform apply) |
Backup cadence: daily for the obs box / Cloudflare / Stripe / Auth0 / GitHub; the RDS dump runs every 6h (off-AWS RPO ≈ 6h). Retention: age cap (14 days) + a hard 9 GB size cap (size-aware prune) so the estate can never exceed the R2 free tier (10 GB). A healthchecks.io dead-man's-switch pages if a backup is missed or fails.
1. What is backed up, and where¶
Three independent durability layers. R2 is the off-site copy; git and Secrets Manager are the authoritative sources for config and secrets respectively.
| Source of truth | Holds | Lives in |
|---|---|---|
git (AI_forSE, tlsstress-art.com) |
ALL config/IaC: Terraform, k8s, observability/cloud/, app code, CI |
GitHub (+ local clones) |
| AWS Secrets Manager | every secret VALUE (tlsstress-obs/*, tlsstress-phase0/*) |
AWS (region us-east-1) |
| Cloudflare R2 | the data that is NOT in git/SM (see below) | r2://tlsstress-obs-backups/ |
R2 object layout (r2://tlsstress-obs-backups/)¶
obs-box/<YYYY-MM-DD>/<volume>.tgz # Hetzner obs stack STATE volumes
grafana_data.tgz Grafana SQLite (annotations, prefs, API keys, alert state)
prometheus_data.tgz metrics TSDB (≤30d history)
loki_data.tgz logs (≤30d)
tempo_data.tgz traces
alertmanager_data.tgz silences / notification log
caddy_data.tgz Let's Encrypt account + issued certs (avoids re-issue rate limits)
cloudflare/<YYYY-MM-DD>/
dns_records.json every DNS record in the zone
rulesets.json WAF custom rulesets (incl. the synthetic allow-rule)
workers_scripts.json Workers inventory (edge synthetic, etc.)
zone_settings.json zone settings snapshot
pagerules.json page rules
stripe/<YYYY-MM-DD>/
balance.json + customers/subscriptions/products/prices/invoices/charges/
payment_intents/coupons/refunds/disputes.json full RO export (billing crown jewel)
aws/<YYYY-MM-DD>/
tfstate/<key> copy of every live Terraform state (infra blueprint)
secrets-manifest.json Secrets Manager NAMES + ARNs (NOT values) — restore checklist
rds-postgres-<HHMMSSZ>.sql.gz pg_dump of `postgres` (customer/token-economy data), every 6h
rds-octopus_admin-<HHMMSSZ>.sql.gz pg_dump of `octopus_admin` (admin-console data), every 6h (§5.2)
auth0/<YYYY-MM-DD>/ Auth0 tenant CONFIG (the IdP for admin/obs SSO):
clients/connections/resource-servers/actions/rules/roles/organizations/
tenant-settings/branding/prompts.json (config only — no user PII)
tbi/<...> MIRROR of the client TBI images S3 bucket (immutable, latest)
github/<repo>.bundle full-history `git bundle` of each repo (latest, restorable)
Producers
- obs-box/* + cloudflare/* + stripe/* + aws/{tfstate,secrets-manifest} +
tbi/* + github/* — the backup-r2 service on the Hetzner box
(observability/cloud/exporters/backup_r2.py), daily, boto3 → R2. It reaches
S3/Secrets Manager/Stripe/GitHub over the internet with dedicated read-only creds.
- aws/rds-*.sql.gz — LIVE: a scheduled in-VPC container Lambda
(obs-rds-backup-r2, EventBridge every 6h cron(30 0/6 * * ? *); source in
observability/cloud/backup-rds-lambda/ + exporters/backup_aws_r2.py), because
RDS is private and unreachable from the box. It dumps every DB in DB_BACKUP_SECRETS
— postgres (via the auto-rotated managed master) and octopus_admin (via
admin-database-url, dumped as its owner role octopus_admin_app since the master
can't read those tables after the role migration) — each to a timestamped key (so
the 6h cadence never overwrites a Bucket Lock-immutable object); tfstate/manifests are
written write-once per day. Native RDS automated snapshots (35-day) + RDS PITR are the
in-account second line.
Per-surface coverage (every public domain is recoverable)¶
| Domain | Code | Data | Config |
|---|---|---|---|
| tlsstress.art (marketing) | github/tlsstress-art.com.bundle |
— | cloudflare/* (DNS/WAF/Workers) |
| app.tlsstress.art (customer-app) | github/AI_forSE.bundle |
aws/rds-*.sql.gz + stripe/* |
cloudflare/* + aws/secrets-manifest |
| admin.tlsstress.art (admin-console) | github/AI_forSE.bundle |
aws/rds-*.sql.gz |
cloudflare/* + aws/secrets-manifest |
| status.tlsstress.art | github/AI_forSE.bundle |
obs-box/grafana_data etc. |
obs-box/caddy_data + cloudflare/* |
| obs.tlsstress.art | github/AI_forSE.bundle |
obs-box/* (all volumes) |
obs-box/caddy_data + cloudflare/* |
| F1 installer (TBI) | github/AI_forSE.bundle |
tbi/* (client images) |
aws/tfstate |
Critical AWS data — explicit inventory¶
The "critical AWS data" copied off-site to R2 is, in priority order: 1. RDS Postgres logical dump — customer-app data, token-economy UTXO ledger, KYC/onboarding, billing state. Irreplaceable. (Native RDS snapshots also cover this in-account; the R2 dump is the off-account copy.) 2. Terraform state — the exact deployed infra graph (also in the TF-state S3 bucket + can be re-derived from git, but the state itself speeds recovery). 3. Secrets manifest (names/ARNs, not values) — the checklist of what must be re-provisioned. Secret values stay in Secrets Manager; for full account-loss DR see §5.4 (optional encrypted secrets bundle). 4. Resource manifest — IDs/ARNs needed to wire a rebuild.
Deliberately NOT in R2 (rebuildable, would waste the free tier): ECR images (rebuilt from git by CI), TBI build artifacts (S3, rebuildable), CloudWatch logs/metrics (the obs stack already ingests the important ones).
2. Prerequisites to perform a restore¶
Collect these before starting (store them OUTSIDE the estate — a password manager):
- R2 S3 credentials (Access Key ID + Secret) + endpoint
https://<ACCOUNT_ID>.r2.cloudflarestorage.com. (Recoverable from the CF dashboard
if you still control the CF account.)
- Cloudflare API token (DNS:Edit + WAF:Edit + Workers:Edit) to re-apply config.
- AWS access (admin or the break-glass role) in account 701047442172 / us-east-1.
- git clones of AI_forSE and tlsstress-art.com.
- SSH keypair for new Hetzner hosts; a Hetzner account + API token.
- rclone or aws CLI (R2 is S3-compatible: aws --endpoint-url <R2> …).
Configure an R2 alias for the commands below (path-style; region auto):
export AWS_ACCESS_KEY_ID=<r2-access-key-id>
export AWS_SECRET_ACCESS_KEY=<r2-secret>
export R2=https://<ACCOUNT_ID>.r2.cloudflarestorage.com
alias r2='aws --endpoint-url "$R2" --region auto'
r2 s3 ls s3://tlsstress-obs-backups/ # list backup dates
DATE=$(r2 s3 ls s3://tlsstress-obs-backups/obs-box/ | awk '{print $2}' | sort | tail -1 | tr -d /)
echo "restoring from $DATE"
3. Restore A — Hetzner observability box¶
Goal: rebuild obs.tlsstress.art (+ status.tlsstress.art) with its history.
- Provision a host — Hetzner Cloud cx23 (or larger), Ubuntu 22.04+, in
hel1(or any region; the IP changes → step 6). Note its public IPv4. - Base setup — install Docker + compose plugin; create
/opt/obs. - Restore config from git (the source of truth — do NOT rely on R2 for config):
git clone <AI_forSE> && cp -r AI_forSE/observability/cloud/* /opt/obs/ - Restore
.envfrom Secrets Manager (values are NOT in R2/git): rebuild/opt/obs/.envfromtlsstress-obs/*(Grafana admin pw, OTLP/synthetic tokens, AWS RO key, Auth0, R2 creds, CF tokens). The keys are enumerated inobservability/cloud/.env.example. - Restore the state volumes from R2 (before first
up, so containers see the data):for v in grafana_data prometheus_data loki_data tempo_data alertmanager_data caddy_data; do r2 s3 cp "s3://tlsstress-obs-backups/obs-box/$DATE/$v.tgz" "/tmp/$v.tgz" docker volume create "tlsstress-obs_$v" docker run --rm -v "tlsstress-obs_$v:/data" -v /tmp:/b alpine \ sh -c "cd /data && tar xzf /b/$v.tgz" doneRestoring
caddy_datacarries the existing Let's Encrypt certs/account → avoids ACME rate limits on a fast rebuild. - Re-point DNS (Cloudflare, DNS-only / grey-cloud so Caddy can ACME):
update
obsandstatusA records to the new IP (see Restore B for the API call). - Bring up the stack (split, per project convention):
cd /opt/obs && docker compose up -d - Verify:
docker compose ps(all Up) ·https://obs.tlsstress.art/grafana/api/health= 200 · dashboards show historical data (proves the TSDB restore) · annotations present in Grafana ·https://status.tlsstress.art200.
4. Restore B — Cloudflare configuration¶
Goal: re-create DNS, WAF rules, Workers, and zone settings in a zone (the same zone after a config wipe, or a new zone after account loss).
- Pull the latest export from R2:
r2 s3 cp "s3://tlsstress-obs-backups/cloudflare/$DATE/dns_records.json" . r2 s3 cp "s3://tlsstress-obs-backups/cloudflare/$DATE/rulesets.json" . r2 s3 cp "s3://tlsstress-obs-backups/cloudflare/$DATE/workers_scripts.json" . - DNS records — re-create each record via the API (idempotent: skip existing):
ZID=<zone-id>; CF=<cf-api-token> jq -c '.result[] | {type,name,content,proxied,ttl,priority,data}' dns_records.json | while read -r rec; do curl -s -X POST "https://api.cloudflare.com/client/v4/zones/$ZID/dns_records" \ -H "Authorization: Bearer $CF" -H "Content-Type: application/json" --data "$rec" >/dev/null doneCritical records to verify first:
app,admin,obs,status,f2, apex, and any synthetic/edge records.obs/statusMUST be grey-cloud (proxied=false). - WAF rulesets — re-create the
http_request_firewall_customentrypoint rules fromrulesets.json(notably the synthetic allow-rule: skip Managed Challenge whenhttp.request.headers["x-tls-synthetic"]matches the bypass token). UsePUT /zones/$ZID/rulesets/phases/http_request_firewall_custom/entrypoint. - Workers — re-deploy from git (the worker source lives in
observability/cloud/synthetics/edge-worker/):wrangler deploy+ re-set the worker secrets.workers_scripts.jsonis the inventory to reconcile against. - Zone settings — reconcile
zone_settings.json(SSL mode, security level, Bot Fight Mode, etc.) viaPATCH /zones/$ZID/settings. - Verify:
digthe critical names → correct IPs;curleach surface → expected status; the synthetic vantages reportsynthetic_probe_successagain.
5. Restore C — AWS¶
AWS recovery has three independent parts: data (RDS), infra (Terraform), secrets (Secrets Manager). Pick the path by failure mode.
5.1 Infra (Terraform) — rebuild the account/region¶
git clone <AI_forSE> && cd AI_forSE/infra # or the terraform root
terraform init # backend = TF-state S3 bucket
terraform plan # if state lost: `terraform import` or
# restore aws/<date>/terraform.tfstate first
terraform apply # recreates VPC, RDS, App Runner, IAM, ECR…
plan:
r2 s3 cp "s3://tlsstress-obs-backups/aws/$DATE/terraform.tfstate" ./terraform.tfstate
5.2 RDS database — restore the business data¶
Path 1 — account intact (preferred): native RDS restore. Use automated snapshots (35-day retention) or point-in-time-restore (RPO ≈ 5 min):
aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier tlsstress-phase0-priv \
--target-db-instance-identifier tlsstress-phase0-restore --use-latest-restorable-time \
--profile tlsstress-prod --region us-east-1
postgres AND octopus_admin), keys timestamped (rds-<db>-<HHMMSSZ>.sql.gz)
— list the day's folder, pick the latest (or nearest target time) per DB, and restore
each into its database:
r2 s3 ls "s3://tlsstress-obs-backups/aws/$DATE/" | grep -E 'rds-.*\.sql\.gz'
# postgres (customer/token-economy):
r2 s3 cp "s3://tlsstress-obs-backups/aws/$DATE/rds-postgres-<HHMMSSZ>.sql.gz" .
gunzip -c rds-postgres-*.sql.gz | psql "postgresql://tlsstress_admin:…@<host>/postgres?sslmode=require"
# octopus_admin (admin-console): create the DB + owner role first if rebuilding from scratch
r2 s3 cp "s3://tlsstress-obs-backups/aws/$DATE/rds-octopus_admin-<HHMMSSZ>.sql.gz" .
gunzip -c rds-octopus_admin-*.sql.gz | psql "postgresql://tlsstress_admin:…@<host>/octopus_admin?sslmode=require"
Reach private RDS to load: from an in-VPC bastion / the NAT instance, or an SSM port-forward (see observability RUNBOOK §DB). Use the
tlsstress_admincredentials from Secrets Manager (database-url-rollback).
5.3 App images — rebuild from git (RPO 0)¶
No image backup needed. Trigger the deploy workflows (build natively on a GH x64
runner → ECR → App Runner): customer-app-deploy.yml, admin-console-deploy.yml.
5.4 Secrets — re-provision¶
Secret values live in Secrets Manager. aws/<date>/secrets-manifest.json (R2) is
the checklist of names/ARNs to recreate. Full account-loss (SM gone too):
restore from the optional encrypted secrets bundle if enabled — an
age/gpg-encrypted export to r2://…/aws/<date>/secrets.age, decryptable only with
the offline DR key (kept in a password manager, never in the estate). If the bundle is
not enabled, secrets must be regenerated/rotated (most are: API tokens, app pepper,
HMAC keys) — see the per-secret notes in .env.example and the service READMEs.
5.5 Verify AWS¶
App Runner services RUNNING + /api/health = 200 through Cloudflare; a smoke login;
RDS reachable with expected row counts; CloudWatch alarms green.
5c. Restore — Auth0 tenant (SSO identity provider)¶
auth0/<date>/*.json is the tenant config (clients, connections, resource
servers, actions/rules, roles, organizations, tenant settings, branding, prompts).
Restore into a (recreated) Auth0 tenant — easiest with the Auth0 Deploy CLI:
r2 s3 cp --recursive "s3://tlsstress-obs-backups/auth0/$DATE/" "./auth0/"
# Map the JSON into a Deploy-CLI directory, then:
a0deploy import --config_file config.json --input_file ./auth0/ # or via Management API PATCH/POST
*-auth0-*, grafana-auth0).
The export is config only (no user passwords — those are unexportable; users re-enrol or are migrated via Auth0's user-import). Client/connection secrets are included only if the M2M token has
read:client_keys/ connection-read scopes.
5b. Restore — GitHub repos · Stripe · TBI images¶
GitHub repos — each github/<repo>.bundle is a full mirror (every branch, tag,
and commit). Restore:
r2 s3 cp "s3://tlsstress-obs-backups/github/AI_forSE.bundle" .
git clone AI_forSE.bundle AI_forSE # working clone from the bundle
# re-publish to a fresh remote after account loss:
git clone --mirror AI_forSE.bundle && cd AI_forSE.git && git push --mirror <new-remote>
stripe/<date>/*.json is a read-only export (records, not a live
restore — Stripe is the system of record). Use it for audit/reconciliation, dispute
evidence, or to re-create the catalog via the Stripe API from products.json /
prices.json if the account is rebuilt. Live data is also recoverable from Stripe's
own dashboard/exports.
TBI client images — tbi/* mirrors the S3 artifact bucket (immutable). Restore =
copy back into a (recreated) S3 bucket, or serve directly:
r2 s3 cp --recursive "s3://tlsstress-obs-backups/tbi/" "./tbi/"
aws s3 cp --recursive ./tbi/ "s3://tlsstress-bootstrap-artifacts/" --profile tlsstress-prod
6. Full-estate recovery order (worst case: total loss)¶
- AWS infra (
terraform apply) → VPC/RDS/App Runner/IAM/ECR exist. - Secrets → re-provision Secrets Manager (§5.4).
- RDS data → restore from R2 dump or snapshot (§5.2).
- App images → CI build+deploy (§5.3).
- Cloudflare → DNS/WAF/Workers (§4) — points the world at the new infra.
- Hetzner obs box → rebuild + restore volumes (§3) — observability returns.
- Re-verify the backup pipeline itself:
backup-r2runs, R2 receives a fresh object, the healthchecks ping turns green.
7. DR drills (do this quarterly — a backup you haven't restored is a hope)¶
- Restore test (non-prod): monthly, restore the latest
obs-box/*volumes into a throwaway compose project and confirm Grafana shows the history. - RDS dump test: quarterly,
gunzip | psqlthe latestrds-*.sql.gzinto a scratch Postgres; check row counts of the ledger + customers tables. - CF config diff: quarterly, diff the live zone against the latest
cloudflare/*export; investigate drift. - Tabletop: walk §6 end-to-end on paper; time each step; update RTO/RPO above.
8. Backup health & monitoring (in the observability stack)¶
The backups are first-class observability citizens. backup_r2.py emits a
Prometheus textfile (/textfile/backup_r2.prom, scraped by node-exporter):
| Metric | Meaning |
|---|---|
backup_r2_last_success_timestamp_seconds |
freshness (recovery point) |
backup_r2_success |
1 if the last run fully succeeded (now reflects any source failure, not just volumes) |
backup_r2_total_bytes / backup_r2_limit_bytes |
R2 disk usage vs the 10 GB free tier |
backup_r2_source_objects / _bytes{source} |
per-source breakdown |
backup_r2_source_success{source} |
1=last attempt OK, 0=failed — per source (github/stripe/auth0/aws/cloudflare/tbi/obs-box) |
backup_r2_source_last_success_timestamp_seconds{source} |
per-source recovery point — a failing source stops advancing → alertable (no silent failure) |
backup_r2_source_failures{source} |
count of failed items in the last attempt |
backup_r2_bucket_public |
SECURITY: 1 if the bucket is publicly exposed |
backup_r2_bucket_lock_enabled |
SECURITY (anti-ransomware): 1 if R2 Bucket Lock immutability is active |
backup_r2_lock_retention_days |
how long every object stays immutable (no delete/overwrite) |
- Dashboard: Grafana “Backups & DR (R2)” (
backups-dr) — last success, run result, R2 usage gauge, bucket-exposure, immutability + lock retention, per-source bytes/objects, usage over time, and a “Backup success & freshness per source” table (OK/FAILED + age per source: GitHub, Stripe, Auth0, AWS, CF, TBI, obs-box). - Alerts (
prometheus/alerts.yml, groupbackup_dr→ Alertmanager → Slack):R2BackupStale(no success >26h, critical),R2BackupDegraded(a source failed),R2UsageHigh(>85% of the free tier),R2BucketPublic(critical security),R2BucketLockDisabled(critical: immutability removed → backups deletable),R2SourceBackupFailed(a single source — e.g. GitHub — failed in the last run),R2SourceBackupStale(a source has not succeeded in >30h). These make a single-source failure non-silent (previously only volume failures flippedbackup_r2_success). (The RDS dump is produced by the separate in-VPC Lambda — monitored via its own CloudWatch errors + healthchecks dead-man — so it is not abackup_r2_source_*series.) - Dead-man's switch: the service also pings healthchecks.io (
/start+/<rc>), so even a totally dead box (no metrics) pages you (email + Slack).
9. Security & access controls (backups must never be public or tampered with)¶
- Private by default + verified: the R2 bucket has no public r2.dev managed
domain and no custom domain; S3 access requires SigV4-signed requests
(anonymous GET → HTTP 400). The
backup_r2_bucket_publicmetric +R2BucketPublicalert continuously detect any accidental/malicious public exposure. - Encryption: R2 encrypts all objects at rest (AES-256); all transfers are TLS.
- Least-privilege creds: the box uses a scoped Cloudflare R2 credential (kept in
the box
.env,chmod 600) and a dedicated read-only AWS IAM user (obs-backup-ro: S3 read on the artifact/state buckets +secretsmanager:ListSecretsonly — it can read NO secret values). GitHub uses a read-only PAT; Stripe a restricted read-only key; Auth0 a read-only M2M app. - No secret values in R2: Secrets Manager values are NOT copied — only names/ARNs (the manifest). Secret values stay in AWS SM (the system of record).
- Anti-ransomware (immutability) — LIVE: every object is protected by an R2
Bucket Lock retention rule (
ransomware-retention,Age=LOCK_RETENTION_DAYS, default 7 days), so an attacker who steals the box's upload credential cannot delete or encrypt/overwrite any backup until it is 7 days old (verified: DELETE and PUT-overwrite of a fresh object both returnObjectLockedByBucketPolicy). R2's S3 API implements neither versioning nor Object Lock, so the rule is applied via the Cloudflare API (PUT /accounts/{acct}/r2/buckets/{bucket}/lock) bybackup_r2.pyon every run (idempotent). BecauseLOCK_RETENTION_DAYS < KEEP_DAILY(7 < 14), the age-based prune only ever removes objects whose lock has already expired. All writes are write-once (github bundles are dated; same-day re-runs HEAD-skip) so the lock never blocks a legitimate backup. Thebackup_r2_bucket_lock_enabledgauge +R2BucketLockDisabledalert detect tampering (the rule being removed). - Override / cleanup (operator): to remove or shorten retention (e.g. to delete a
locked object early), an operator with an R2-write Cloudflare token can
PUTan empty/looser ruleset to the/lockendpoint — the box's credential path re-applies the strict rule next run, so do this deliberately.
Appendix — restore one-liners¶
# latest available backup date
r2 s3 ls s3://tlsstress-obs-backups/obs-box/ | awk '{print $2}' | sort | tail -1
# total R2 usage (free-tier check)
r2 s3 ls --summarize --human-readable --recursive s3://tlsstress-obs-backups/ | tail -2
# pull an entire day's backup locally
r2 s3 cp --recursive "s3://tlsstress-obs-backups/obs-box/$DATE/" "./restore-$DATE/"