Skip to content

Disaster Recovery — Restore Runbook (Hetzner · Cloudflare · AWS from R2)

Audience: the operator (or a successor) rebuilding tlsstress.art after the loss of a host, an account, a region, or the whole estate. Scope: restore every environment from the off-site backups in Cloudflare R2 (r2://tlsstress-obs-backups/) plus the two always-available sources of truth: the git repository (all config/IaC) and AWS Secrets Manager (secret values).

Companion docs: observability README · RUNBOOK (day-2 ops) · AWS architecture suite under docs/aws/. This document is recovery, not day-2.


0. Recovery objectives

Tier Asset RPO (max data loss) RTO (time to restore)
0 Business DB (RDS Postgres) ≤ 6h (R2 dump every 6h) · ≤ 5min (RDS PITR, if account intact) 1–2h
1 Observability stack (Hetzner) ≤ 24h (daily R2 volume snapshot) 30–60min
1 Cloudflare config (DNS/WAF/Workers) ≤ 24h (daily R2 export) 30min
2 App images (customer-app/admin) 0 (rebuilt from git) 20–40min (CI build+deploy)
2 Infra (VPC/RDS/AppRunner/IAM) 0 (Terraform in git) 1–3h (terraform apply)

Backup cadence: daily for the obs box / Cloudflare / Stripe / Auth0 / GitHub; the RDS dump runs every 6h (off-AWS RPO ≈ 6h). Retention: age cap (14 days) + a hard 9 GB size cap (size-aware prune) so the estate can never exceed the R2 free tier (10 GB). A healthchecks.io dead-man's-switch pages if a backup is missed or fails.


1. What is backed up, and where

Three independent durability layers. R2 is the off-site copy; git and Secrets Manager are the authoritative sources for config and secrets respectively.

Source of truth Holds Lives in
git (AI_forSE, tlsstress-art.com) ALL config/IaC: Terraform, k8s, observability/cloud/, app code, CI GitHub (+ local clones)
AWS Secrets Manager every secret VALUE (tlsstress-obs/*, tlsstress-phase0/*) AWS (region us-east-1)
Cloudflare R2 the data that is NOT in git/SM (see below) r2://tlsstress-obs-backups/

R2 object layout (r2://tlsstress-obs-backups/)

obs-box/<YYYY-MM-DD>/<volume>.tgz      # Hetzner obs stack STATE volumes
   grafana_data.tgz       Grafana SQLite (annotations, prefs, API keys, alert state)
   prometheus_data.tgz    metrics TSDB (≤30d history)
   loki_data.tgz          logs (≤30d)
   tempo_data.tgz         traces
   alertmanager_data.tgz  silences / notification log
   caddy_data.tgz         Let's Encrypt account + issued certs (avoids re-issue rate limits)
cloudflare/<YYYY-MM-DD>/
   dns_records.json       every DNS record in the zone
   rulesets.json          WAF custom rulesets (incl. the synthetic allow-rule)
   workers_scripts.json   Workers inventory (edge synthetic, etc.)
   zone_settings.json     zone settings snapshot
   pagerules.json         page rules
stripe/<YYYY-MM-DD>/
   balance.json + customers/subscriptions/products/prices/invoices/charges/
   payment_intents/coupons/refunds/disputes.json   full RO export (billing crown jewel)
aws/<YYYY-MM-DD>/
   tfstate/<key>          copy of every live Terraform state (infra blueprint)
   secrets-manifest.json  Secrets Manager NAMES + ARNs (NOT values) — restore checklist
   rds-postgres-<HHMMSSZ>.sql.gz        pg_dump of `postgres` (customer/token-economy data), every 6h
   rds-octopus_admin-<HHMMSSZ>.sql.gz   pg_dump of `octopus_admin` (admin-console data), every 6h (§5.2)
auth0/<YYYY-MM-DD>/        Auth0 tenant CONFIG (the IdP for admin/obs SSO):
   clients/connections/resource-servers/actions/rules/roles/organizations/
   tenant-settings/branding/prompts.json   (config only — no user PII)
tbi/<...>                 MIRROR of the client TBI images S3 bucket (immutable, latest)
github/<repo>.bundle      full-history `git bundle` of each repo (latest, restorable)

Producers - obs-box/* + cloudflare/* + stripe/* + aws/{tfstate,secrets-manifest} + tbi/* + github/* — the backup-r2 service on the Hetzner box (observability/cloud/exporters/backup_r2.py), daily, boto3 → R2. It reaches S3/Secrets Manager/Stripe/GitHub over the internet with dedicated read-only creds. - aws/rds-*.sql.gzLIVE: a scheduled in-VPC container Lambda (obs-rds-backup-r2, EventBridge every 6h cron(30 0/6 * * ? *); source in observability/cloud/backup-rds-lambda/ + exporters/backup_aws_r2.py), because RDS is private and unreachable from the box. It dumps every DB in DB_BACKUP_SECRETSpostgres (via the auto-rotated managed master) and octopus_admin (via admin-database-url, dumped as its owner role octopus_admin_app since the master can't read those tables after the role migration) — each to a timestamped key (so the 6h cadence never overwrites a Bucket Lock-immutable object); tfstate/manifests are written write-once per day. Native RDS automated snapshots (35-day) + RDS PITR are the in-account second line.

Per-surface coverage (every public domain is recoverable)

Domain Code Data Config
tlsstress.art (marketing) github/tlsstress-art.com.bundle cloudflare/* (DNS/WAF/Workers)
app.tlsstress.art (customer-app) github/AI_forSE.bundle aws/rds-*.sql.gz + stripe/* cloudflare/* + aws/secrets-manifest
admin.tlsstress.art (admin-console) github/AI_forSE.bundle aws/rds-*.sql.gz cloudflare/* + aws/secrets-manifest
status.tlsstress.art github/AI_forSE.bundle obs-box/grafana_data etc. obs-box/caddy_data + cloudflare/*
obs.tlsstress.art github/AI_forSE.bundle obs-box/* (all volumes) obs-box/caddy_data + cloudflare/*
F1 installer (TBI) github/AI_forSE.bundle tbi/* (client images) aws/tfstate

Critical AWS data — explicit inventory

The "critical AWS data" copied off-site to R2 is, in priority order: 1. RDS Postgres logical dump — customer-app data, token-economy UTXO ledger, KYC/onboarding, billing state. Irreplaceable. (Native RDS snapshots also cover this in-account; the R2 dump is the off-account copy.) 2. Terraform state — the exact deployed infra graph (also in the TF-state S3 bucket + can be re-derived from git, but the state itself speeds recovery). 3. Secrets manifest (names/ARNs, not values) — the checklist of what must be re-provisioned. Secret values stay in Secrets Manager; for full account-loss DR see §5.4 (optional encrypted secrets bundle). 4. Resource manifest — IDs/ARNs needed to wire a rebuild.

Deliberately NOT in R2 (rebuildable, would waste the free tier): ECR images (rebuilt from git by CI), TBI build artifacts (S3, rebuildable), CloudWatch logs/metrics (the obs stack already ingests the important ones).


2. Prerequisites to perform a restore

Collect these before starting (store them OUTSIDE the estate — a password manager): - R2 S3 credentials (Access Key ID + Secret) + endpoint https://<ACCOUNT_ID>.r2.cloudflarestorage.com. (Recoverable from the CF dashboard if you still control the CF account.) - Cloudflare API token (DNS:Edit + WAF:Edit + Workers:Edit) to re-apply config. - AWS access (admin or the break-glass role) in account 701047442172 / us-east-1. - git clones of AI_forSE and tlsstress-art.com. - SSH keypair for new Hetzner hosts; a Hetzner account + API token. - rclone or aws CLI (R2 is S3-compatible: aws --endpoint-url <R2> …).

Configure an R2 alias for the commands below (path-style; region auto):

export AWS_ACCESS_KEY_ID=<r2-access-key-id>
export AWS_SECRET_ACCESS_KEY=<r2-secret>
export R2=https://<ACCOUNT_ID>.r2.cloudflarestorage.com
alias r2='aws --endpoint-url "$R2" --region auto'
r2 s3 ls s3://tlsstress-obs-backups/            # list backup dates
DATE=$(r2 s3 ls s3://tlsstress-obs-backups/obs-box/ | awk '{print $2}' | sort | tail -1 | tr -d /)
echo "restoring from $DATE"


3. Restore A — Hetzner observability box

Goal: rebuild obs.tlsstress.art (+ status.tlsstress.art) with its history.

  1. Provision a host — Hetzner Cloud cx23 (or larger), Ubuntu 22.04+, in hel1 (or any region; the IP changes → step 6). Note its public IPv4.
  2. Base setup — install Docker + compose plugin; create /opt/obs.
  3. Restore config from git (the source of truth — do NOT rely on R2 for config):
    git clone <AI_forSE> && cp -r AI_forSE/observability/cloud/* /opt/obs/
    
  4. Restore .env from Secrets Manager (values are NOT in R2/git): rebuild /opt/obs/.env from tlsstress-obs/* (Grafana admin pw, OTLP/synthetic tokens, AWS RO key, Auth0, R2 creds, CF tokens). The keys are enumerated in observability/cloud/.env.example.
  5. Restore the state volumes from R2 (before first up, so containers see the data):
    for v in grafana_data prometheus_data loki_data tempo_data alertmanager_data caddy_data; do
      r2 s3 cp "s3://tlsstress-obs-backups/obs-box/$DATE/$v.tgz" "/tmp/$v.tgz"
      docker volume create "tlsstress-obs_$v"
      docker run --rm -v "tlsstress-obs_$v:/data" -v /tmp:/b alpine \
        sh -c "cd /data && tar xzf /b/$v.tgz"
    done
    

    Restoring caddy_data carries the existing Let's Encrypt certs/account → avoids ACME rate limits on a fast rebuild.

  6. Re-point DNS (Cloudflare, DNS-only / grey-cloud so Caddy can ACME): update obs and status A records to the new IP (see Restore B for the API call).
  7. Bring up the stack (split, per project convention):
    cd /opt/obs && docker compose up -d
    
  8. Verify: docker compose ps (all Up) · https://obs.tlsstress.art/grafana/api/health = 200 · dashboards show historical data (proves the TSDB restore) · annotations present in Grafana · https://status.tlsstress.art 200.

4. Restore B — Cloudflare configuration

Goal: re-create DNS, WAF rules, Workers, and zone settings in a zone (the same zone after a config wipe, or a new zone after account loss).

  1. Pull the latest export from R2:
    r2 s3 cp "s3://tlsstress-obs-backups/cloudflare/$DATE/dns_records.json" .
    r2 s3 cp "s3://tlsstress-obs-backups/cloudflare/$DATE/rulesets.json" .
    r2 s3 cp "s3://tlsstress-obs-backups/cloudflare/$DATE/workers_scripts.json" .
    
  2. DNS records — re-create each record via the API (idempotent: skip existing):
    ZID=<zone-id>; CF=<cf-api-token>
    jq -c '.result[] | {type,name,content,proxied,ttl,priority,data}' dns_records.json |
    while read -r rec; do
      curl -s -X POST "https://api.cloudflare.com/client/v4/zones/$ZID/dns_records" \
        -H "Authorization: Bearer $CF" -H "Content-Type: application/json" --data "$rec" >/dev/null
    done
    

    Critical records to verify first: app, admin, obs, status, f2, apex, and any synthetic/edge records. obs/status MUST be grey-cloud (proxied=false).

  3. WAF rulesets — re-create the http_request_firewall_custom entrypoint rules from rulesets.json (notably the synthetic allow-rule: skip Managed Challenge when http.request.headers["x-tls-synthetic"] matches the bypass token). Use PUT /zones/$ZID/rulesets/phases/http_request_firewall_custom/entrypoint.
  4. Workers — re-deploy from git (the worker source lives in observability/cloud/synthetics/edge-worker/): wrangler deploy + re-set the worker secrets. workers_scripts.json is the inventory to reconcile against.
  5. Zone settings — reconcile zone_settings.json (SSL mode, security level, Bot Fight Mode, etc.) via PATCH /zones/$ZID/settings.
  6. Verify: dig the critical names → correct IPs; curl each surface → expected status; the synthetic vantages report synthetic_probe_success again.

5. Restore C — AWS

AWS recovery has three independent parts: data (RDS), infra (Terraform), secrets (Secrets Manager). Pick the path by failure mode.

5.1 Infra (Terraform) — rebuild the account/region

git clone <AI_forSE> && cd AI_forSE/infra   # or the terraform root
terraform init                              # backend = TF-state S3 bucket
terraform plan                              # if state lost: `terraform import` or
                                            # restore aws/<date>/terraform.tfstate first
terraform apply                             # recreates VPC, RDS, App Runner, IAM, ECR…
If the state is lost (account loss), seed it from R2 before plan:
r2 s3 cp "s3://tlsstress-obs-backups/aws/$DATE/terraform.tfstate" ./terraform.tfstate

5.2 RDS database — restore the business data

Path 1 — account intact (preferred): native RDS restore. Use automated snapshots (35-day retention) or point-in-time-restore (RPO ≈ 5 min):

aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier tlsstress-phase0-priv \
  --target-db-instance-identifier tlsstress-phase0-restore --use-latest-restorable-time \
  --profile tlsstress-prod --region us-east-1
Path 2 — off-account / logical restore from R2 (the 6-hourly dumps). Both databases are dumped (postgres AND octopus_admin), keys timestamped (rds-<db>-<HHMMSSZ>.sql.gz) — list the day's folder, pick the latest (or nearest target time) per DB, and restore each into its database:
r2 s3 ls "s3://tlsstress-obs-backups/aws/$DATE/" | grep -E 'rds-.*\.sql\.gz'
# postgres (customer/token-economy):
r2 s3 cp "s3://tlsstress-obs-backups/aws/$DATE/rds-postgres-<HHMMSSZ>.sql.gz" .
gunzip -c rds-postgres-*.sql.gz | psql "postgresql://tlsstress_admin:…@<host>/postgres?sslmode=require"
# octopus_admin (admin-console): create the DB + owner role first if rebuilding from scratch
r2 s3 cp "s3://tlsstress-obs-backups/aws/$DATE/rds-octopus_admin-<HHMMSSZ>.sql.gz" .
gunzip -c rds-octopus_admin-*.sql.gz | psql "postgresql://tlsstress_admin:…@<host>/octopus_admin?sslmode=require"

Reach private RDS to load: from an in-VPC bastion / the NAT instance, or an SSM port-forward (see observability RUNBOOK §DB). Use the tlsstress_admin credentials from Secrets Manager (database-url-rollback).

5.3 App images — rebuild from git (RPO 0)

No image backup needed. Trigger the deploy workflows (build natively on a GH x64 runner → ECR → App Runner): customer-app-deploy.yml, admin-console-deploy.yml.

5.4 Secrets — re-provision

Secret values live in Secrets Manager. aws/<date>/secrets-manifest.json (R2) is the checklist of names/ARNs to recreate. Full account-loss (SM gone too): restore from the optional encrypted secrets bundle if enabled — an age/gpg-encrypted export to r2://…/aws/<date>/secrets.age, decryptable only with the offline DR key (kept in a password manager, never in the estate). If the bundle is not enabled, secrets must be regenerated/rotated (most are: API tokens, app pepper, HMAC keys) — see the per-secret notes in .env.example and the service READMEs.

5.5 Verify AWS

App Runner services RUNNING + /api/health = 200 through Cloudflare; a smoke login; RDS reachable with expected row counts; CloudWatch alarms green.


5c. Restore — Auth0 tenant (SSO identity provider)

auth0/<date>/*.json is the tenant config (clients, connections, resource servers, actions/rules, roles, organizations, tenant settings, branding, prompts). Restore into a (recreated) Auth0 tenant — easiest with the Auth0 Deploy CLI:

r2 s3 cp --recursive "s3://tlsstress-obs-backups/auth0/$DATE/" "./auth0/"
# Map the JSON into a Deploy-CLI directory, then:
a0deploy import --config_file config.json --input_file ./auth0/   # or via Management API PATCH/POST
Then re-point the apps (admin/obs/customer-app) at the restored tenant's domain + client IDs/secrets (these live in Secrets Manager — *-auth0-*, grafana-auth0).

The export is config only (no user passwords — those are unexportable; users re-enrol or are migrated via Auth0's user-import). Client/connection secrets are included only if the M2M token has read:client_keys / connection-read scopes.

5b. Restore — GitHub repos · Stripe · TBI images

GitHub repos — each github/<repo>.bundle is a full mirror (every branch, tag, and commit). Restore:

r2 s3 cp "s3://tlsstress-obs-backups/github/AI_forSE.bundle" .
git clone AI_forSE.bundle AI_forSE         # working clone from the bundle
# re-publish to a fresh remote after account loss:
git clone --mirror AI_forSE.bundle && cd AI_forSE.git && git push --mirror <new-remote>
Stripestripe/<date>/*.json is a read-only export (records, not a live restore — Stripe is the system of record). Use it for audit/reconciliation, dispute evidence, or to re-create the catalog via the Stripe API from products.json / prices.json if the account is rebuilt. Live data is also recoverable from Stripe's own dashboard/exports.

TBI client imagestbi/* mirrors the S3 artifact bucket (immutable). Restore = copy back into a (recreated) S3 bucket, or serve directly:

r2 s3 cp --recursive "s3://tlsstress-obs-backups/tbi/" "./tbi/"
aws s3 cp --recursive ./tbi/ "s3://tlsstress-bootstrap-artifacts/" --profile tlsstress-prod


6. Full-estate recovery order (worst case: total loss)

  1. AWS infra (terraform apply) → VPC/RDS/App Runner/IAM/ECR exist.
  2. Secrets → re-provision Secrets Manager (§5.4).
  3. RDS data → restore from R2 dump or snapshot (§5.2).
  4. App images → CI build+deploy (§5.3).
  5. Cloudflare → DNS/WAF/Workers (§4) — points the world at the new infra.
  6. Hetzner obs box → rebuild + restore volumes (§3) — observability returns.
  7. Re-verify the backup pipeline itself: backup-r2 runs, R2 receives a fresh object, the healthchecks ping turns green.

7. DR drills (do this quarterly — a backup you haven't restored is a hope)

  • Restore test (non-prod): monthly, restore the latest obs-box/* volumes into a throwaway compose project and confirm Grafana shows the history.
  • RDS dump test: quarterly, gunzip | psql the latest rds-*.sql.gz into a scratch Postgres; check row counts of the ledger + customers tables.
  • CF config diff: quarterly, diff the live zone against the latest cloudflare/* export; investigate drift.
  • Tabletop: walk §6 end-to-end on paper; time each step; update RTO/RPO above.

8. Backup health & monitoring (in the observability stack)

The backups are first-class observability citizens. backup_r2.py emits a Prometheus textfile (/textfile/backup_r2.prom, scraped by node-exporter):

Metric Meaning
backup_r2_last_success_timestamp_seconds freshness (recovery point)
backup_r2_success 1 if the last run fully succeeded (now reflects any source failure, not just volumes)
backup_r2_total_bytes / backup_r2_limit_bytes R2 disk usage vs the 10 GB free tier
backup_r2_source_objects / _bytes{source} per-source breakdown
backup_r2_source_success{source} 1=last attempt OK, 0=failed — per source (github/stripe/auth0/aws/cloudflare/tbi/obs-box)
backup_r2_source_last_success_timestamp_seconds{source} per-source recovery point — a failing source stops advancing → alertable (no silent failure)
backup_r2_source_failures{source} count of failed items in the last attempt
backup_r2_bucket_public SECURITY: 1 if the bucket is publicly exposed
backup_r2_bucket_lock_enabled SECURITY (anti-ransomware): 1 if R2 Bucket Lock immutability is active
backup_r2_lock_retention_days how long every object stays immutable (no delete/overwrite)
  • Dashboard: Grafana “Backups & DR (R2)” (backups-dr) — last success, run result, R2 usage gauge, bucket-exposure, immutability + lock retention, per-source bytes/objects, usage over time, and a “Backup success & freshness per source” table (OK/FAILED + age per source: GitHub, Stripe, Auth0, AWS, CF, TBI, obs-box).
  • Alerts (prometheus/alerts.yml, group backup_dr → Alertmanager → Slack): R2BackupStale (no success >26h, critical), R2BackupDegraded (a source failed), R2UsageHigh (>85% of the free tier), R2BucketPublic (critical security), R2BucketLockDisabled (critical: immutability removed → backups deletable), R2SourceBackupFailed (a single source — e.g. GitHub — failed in the last run), R2SourceBackupStale (a source has not succeeded in >30h). These make a single-source failure non-silent (previously only volume failures flipped backup_r2_success). (The RDS dump is produced by the separate in-VPC Lambda — monitored via its own CloudWatch errors + healthchecks dead-man — so it is not a backup_r2_source_* series.)
  • Dead-man's switch: the service also pings healthchecks.io (/start + /<rc>), so even a totally dead box (no metrics) pages you (email + Slack).

9. Security & access controls (backups must never be public or tampered with)

  • Private by default + verified: the R2 bucket has no public r2.dev managed domain and no custom domain; S3 access requires SigV4-signed requests (anonymous GET → HTTP 400). The backup_r2_bucket_public metric + R2BucketPublic alert continuously detect any accidental/malicious public exposure.
  • Encryption: R2 encrypts all objects at rest (AES-256); all transfers are TLS.
  • Least-privilege creds: the box uses a scoped Cloudflare R2 credential (kept in the box .env, chmod 600) and a dedicated read-only AWS IAM user (obs-backup-ro: S3 read on the artifact/state buckets + secretsmanager:ListSecrets only — it can read NO secret values). GitHub uses a read-only PAT; Stripe a restricted read-only key; Auth0 a read-only M2M app.
  • No secret values in R2: Secrets Manager values are NOT copied — only names/ARNs (the manifest). Secret values stay in AWS SM (the system of record).
  • Anti-ransomware (immutability) — LIVE: every object is protected by an R2 Bucket Lock retention rule (ransomware-retention, Age = LOCK_RETENTION_DAYS, default 7 days), so an attacker who steals the box's upload credential cannot delete or encrypt/overwrite any backup until it is 7 days old (verified: DELETE and PUT-overwrite of a fresh object both return ObjectLockedByBucketPolicy). R2's S3 API implements neither versioning nor Object Lock, so the rule is applied via the Cloudflare API (PUT /accounts/{acct}/r2/buckets/{bucket}/lock) by backup_r2.py on every run (idempotent). Because LOCK_RETENTION_DAYS < KEEP_DAILY (7 < 14), the age-based prune only ever removes objects whose lock has already expired. All writes are write-once (github bundles are dated; same-day re-runs HEAD-skip) so the lock never blocks a legitimate backup. The backup_r2_bucket_lock_enabled gauge + R2BucketLockDisabled alert detect tampering (the rule being removed).
  • Override / cleanup (operator): to remove or shorten retention (e.g. to delete a locked object early), an operator with an R2-write Cloudflare token can PUT an empty/looser ruleset to the /lock endpoint — the box's credential path re-applies the strict rule next run, so do this deliberately.

Appendix — restore one-liners

# latest available backup date
r2 s3 ls s3://tlsstress-obs-backups/obs-box/ | awk '{print $2}' | sort | tail -1

# total R2 usage (free-tier check)
r2 s3 ls --summarize --human-readable --recursive s3://tlsstress-obs-backups/ | tail -2

# pull an entire day's backup locally
r2 s3 cp --recursive "s3://tlsstress-obs-backups/obs-box/$DATE/" "./restore-$DATE/"