Skip to content

RUNBOOK — Cloud Observability Platform

Operational manual for obs.tlsstress.art. Architecture: HLD. Component details: LLD. How-to guides: GUIDES.

Access: ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 · stack at /opt/obs · https://obs.tlsstress.art (landing) · …/grafana (UI).


0. Golden commands

cd /opt/obs
docker compose ps                         # status of all services
docker compose logs -f grafana            # tail one service
docker compose up -d                       # apply compose changes (idempotent)
docker compose up -d --force-recreate svc  # recreate one service (env/mount change)
docker compose kill -s SIGHUP prometheus   # hot-reload Prometheus rules/targets
docker compose restart vector              # restart (re-reads bind-mounted config)
grep GF_ADMIN_PASSWORD /opt/obs/.env       # local Grafana admin password

Config files are bind-mounted → a restart re-reads them. Env changes in docker-compose.yml/.env require up -d --force-recreate <svc>.


1. Deploy / update from the repo

Source of truth is observability/cloud/ in the AI_forSE repo. To push a change:

# from a workstation with the repo + SSH key
cd AI_forSE/observability/cloud
KEY=~/.ssh/tlsstress_f2_hetzner ; BOX=root@89.167.3.1
# clean macOS junk first (tar/scp create ._* AppleDouble files)
find . -name '._*' -delete
scp -i $KEY <changed-file> $BOX:/opt/obs/<path>
ssh -i $KEY $BOX 'cd /opt/obs && docker compose up -d'   # or --force-recreate <svc>

Dashboards (grafana/dashboards/*.json) auto-provision within ~10s of landing on the box — no restart needed. Prometheus rules/targets: kill -s SIGHUP.


2. Health verification

# external
curl -sI https://obs.tlsstress.art/            # 200 (landing)
curl -sI https://obs.tlsstress.art/grafana/api/health   # 200

# internal (from the box, on the compose network)
NET=tlsstress-obs_default
for s in prometheus:9090/-/healthy loki:3100/ready tempo:3200/ready \
         vector:8686/health blackbox:9115/-/healthy grafana:3000/api/health; do
  docker run --rm --network $NET curlimages/curl:8.10.1 -s -o /dev/null -w "$s %{http_code}\n" http://$s
done

# data flowing
docker run --rm --network $NET curlimages/curl:8.10.1 -s \
  'http://prometheus:9090/api/v1/query?query=up' | python3 -m json.tool | grep -c '"value"'

3. Data-source activation (drop a key, recreate poller/grafana)

Source Status .env key(s) Apply
Cloudflare ✅ live CF_API_TOKEN docker compose up -d --force-recreate cloud-poller
Stripe ✅ live STRIPE_API_KEY (rk_live_… read-only) … cloud-poller
AWS CloudWatch ✅ live AWS_RO_ACCESS_KEY_ID/SECRET + AWS_REGION … grafana
Auth0 SSO ✅ live GRAFANA_AUTH0_ENABLED/CLIENT_ID/SECRET … grafana (see SSO)
Plausible 🟡 deferred PLAUSIBLE_API_KEY + PLAUSIBLE_SITE_IDS … cloud-poller

Verify a poller source: docker compose exec -T cloud-poller sh -c 'cat /textfile/cloud.prom | grep _scrape_up'. Verify a Grafana datasource: curl -u admin:$PW https://obs.tlsstress.art/grafana/api/datasources/uid/<uid>/health.

3.1 Secret hand-off pattern (AWS Secrets Manager → box .env)

Secrets are passed without ever touching chat: the operator stores them in Secrets Manager, then they're pulled and piped to the box over SSH stdin.

# operator (once): aws secretsmanager create-secret --name tlsstress-obs/<x> --secret-string '...'
KEY=$(aws secretsmanager get-secret-value --profile tlsstress-prod --region us-east-1 \
  --secret-id tlsstress-obs/<x> --query SecretString --output text)
printf '%s' "$KEY" | ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 \
  'k=$(cat); ...write to /opt/obs/.env...; docker compose up -d --force-recreate <svc>'; unset KEY
Canonical secrets today: tlsstress-obs/stripe-ro-key, tlsstress-obs/grafana-auth0 (JSON {client_id,client_secret}), tlsstress-obs/aws-ro (JSON {access_key_id, secret_access_key,region}). IAM user for AWS: obs-cloudwatch-ro.


4. Backup & restore (DR)

State lives in named volumes (grafana_data, prometheus_data, loki_data, tempo_data, caddy_data). The VM is a SPOF — snapshot off-box.

4.1 On-demand backup

cd /opt/obs
ts=$(date -u +%Y%m%dT%H%M%SZ)
for v in grafana_data prometheus_data loki_data caddy_data; do
  docker run --rm -v tlsstress-obs_$v:/data -v /opt/obs/backups:/b alpine \
    tar czf /b/${v}_${ts}.tgz -C /data .
done
# then push /opt/obs/backups/*.tgz to object storage (S3/B2/R2) via rclone/restic

4.2 Restore a volume

docker compose stop <svc>
docker run --rm -v tlsstress-obs_<vol>:/data -v /opt/obs/backups:/b alpine \
  sh -c 'rm -rf /data/* && tar xzf /b/<vol>_<ts>.tgz -C /data'
docker compose start <svc>

Scheduled + monitored: scripts/backup.sh runs daily 03:30 UTC (host cron) and pings a healthchecks.io check (BACKUP_HEALTHCHECK_URL in the cron line) — /start then /<exit-code>, so a missed or failed backup pages you (cron monitoring). Check: "tlsstress-obs backups" (tlsstress-obs/healthcheck-backup-url).

TODO (standing): off-box copy — set RCLONE_REMOTE to an R2/S3 target (free tier) so backups survive box loss. Local snapshots already run.


5. Secret rotation

.env holds: GF_ADMIN_PASSWORD, INGEST_TOKEN, CF_API_TOKEN, STRIPE_API_KEY, AWS_RO_*, GRAFANA_AUTH0_CLIENT_SECRET.

cd /opt/obs
# 1) rotate at the provider (CF/Stripe/AWS/Auth0 dashboard or `aws iam`)
# 2) update .env (keep chmod 600)
chmod 600 .env
# 3) recreate the consumer
docker compose up -d --force-recreate cloud-poller   # CF/Stripe
docker compose up -d --force-recreate grafana        # AWS/Auth0/admin pw

Pending: the Cloudflare Analytics token was shared in chat — rotate it once dashboards are confirmed (create a fresh Zone-Analytics:Read token, swap, recreate poller).


6. Common incidents

Symptom Likely cause Action
*_scrape_up=0 for CF/Stripe token invalid/expired or API error docker compose logs cloud-poller → fix key → recreate
Vector Restarting (78) invalid vector.yaml (exit 78 = config) docker compose logs vector → fix config → restart
Grafana 301 loop enforce_domain w/o domain, or root_url/subpath mismatch confirm GF_SERVER_DOMAIN + SERVE_FROM_SUB_PATH=true
Tempo /ready 503 at boot startup settling (≤15s after ingester ready) wait; persistent → logs tempo
ACME cert fails DNS not grey-cloud, or 80 blocked set Cloudflare DNS-only; open 80 in firewall
Disk filling Loki/Prom retention or backups dir check df -h; prune /opt/obs/backups; lower retention
Probe probe_success=0 on F2 F2 serves only /trigger (404 on /) expected — uses http_edge_up module
CloudWatch DS "missing default region" ${VAR:-default} not interpolated, or no secureJsonData use $AWS_REGION (no :-) + authType:keys with secureJsonData keys
CloudWatch DS auth error IAM key not propagated / wrong perms wait ~15s after key creation; user needs CloudWatchReadOnlyAccess
AWS panels all "No data" (DS health OK, query API returns data) panel targets missing builder fields add metricQueryType:0+metricEditorMode:0+queryMode:"Metrics"+explicit region to every target; hard-refresh browser
Stripe/poller source not updating after key set poller image runs a stale bind-mounted script re-scp exporters/cloud_poller.py then up -d --force-recreate cloud-poller
Infra → Containers (cAdvisor) panels empty cAdvisor can't resolve name (Docker containerd-snapshotter + cgroup v2) keep cAdvisor v0.49.1 (not v0.52+) + run the docker-names sidecar; dashboard joins on short_id (see LLD §7)
AppDBDown (/api/ready=503) app can't reach a DB — stale credential after RDS master rotation, or RDS down see §6.1; obs-db-selfheal auto-redeploys — if it persists, check RDS + the app's DB credential
"DB indisponível" in the app but blackbox green shallow /api/health=200 while DB down the pre-2026-06-16 blind spot — now caught by AppDBDown; probe /api/ready (deep), not /api/health
DashboardDBDown / DashboardDown operator dashboard's Postgres unreachable (dashboard_db_up=0), or the process/scrape is down (up=0) DUT-side alert (not this box). Cockpit shows frozen data + a red "sem dados (DB)" live badge. Check the on-prem PgBouncer/Postgres; the dashboard /api/ready is the deep probe
RDSConnectionsHigh / RDSStorageLow / RDSCPUHigh / RDSMemoryLow RDS saturation before it 503s (pool exhaustion, low disk, CPU/mem pressure) from aws-rds-metrics (CloudWatch RO key). Thresholds tuned for db.t4g.micro — see §6.2. RDSStorageCritical (<1 GB) is paging-critical: grow storage now
RDSMetricsBlind (rds_metrics_scrape_ok=0) aws-rds-metrics can't read CloudWatch (expired RO key / IAM / throttling) RDS saturation alerts are silent until fixed — docker compose logs aws-rds-metrics; check the RO key in .env
Critical alert fired but nobody woke up critical route is Slack-@channel only (no phone pager) drop a PagerDuty routing key (or SMTP) into the pager receiver in alertmanager.yml (slot pre-built) — see §6.3

6.1 DB unreachable ("DB indisponível") — RCA playbook

The 2026-06-16 admin outage: RDS managed master-secret rotation (every 7d) changed tlsstress_admin's password; the admin connected as that master via a static secret → auth failed → graceful 200 page → ~33h invisible (shallow health + no DB alert).

Durable fix (in place): each app connects as a dedicated non-master role (octopus_admin_app owns admin's octopus_admin; tlsstress_app for customer's postgres) → master rotation never breaks them. Never point an app's DATABASE_URL at the master tlsstress_admin. Next master rotation ≈ every 7 days.

If AppDBDown fires: 1. curl -s https://<app>.tlsstress.art/api/ready | jq — see which DB check is ok:false. 2. obs-db-selfheal already redeploys the App Runner service on a 503 (re-fetches the secret + rebuilds the pool). Watch the deployment; if it recovers, done. 3. Persists? Verify RDS available (aws rds describe-db-instances) + that the app's DB secret authenticates as its role (not the rotated master). 4. DDL on the private RDS → in-VPC via the NAT instance (i-0894…, SSM): temp SG ingress to the RDS SG sg-02bf33572b96f2855 + temp IAM read of the managed secret, remove both after. (Lambdas in sg-043effaa4b9c3bdac already reach RDS — preferred home.)


6.2 RDS saturation (RDS* alerts)

The aws-rds-metrics exporter pulls AWS/RDS CloudWatch metrics (RO key, same as the cron monitor) → :9092/metrics → Prometheus. It pages on saturation before a 503 — the gap AppDBDown (which only fires once the DB is fully unreachable) couldn't see.

Thresholds are tuned for db.t4g.micro (~112 max connections, 1 GB RAM) — re-tune on resize (alerts.yml group rds): connections >90, free storage <2 GB (warn) / <1 GB (critical), CPU >85%, freeable memory <100 MB.

  • RDSConnectionsHigh → check PgBouncer / a connection leak; new connections refuse before /api/ready flips, so this fires first.
  • RDSStorageCritical (<1 GB) → RDS hard-stops writes when storage runs out; grow allocated storage now (aws rds modify-db-instance --allocated-storage).
  • RDSMetricsBlind → the exporter can't read CloudWatch; the other RDS* alerts are silent until fixed. docker compose logs aws-rds-metrics; verify the RO key in .env.

6.3 Wire a phone pager (critical escalation)

critical alerts route to the pager receiver (hourly re-page + Slack @channel). For guaranteed overnight phone paging, drop a credential into the pre-built slot in alertmanager.yml. Recommended $0 / no-account path = ntfy.sh:

# 1) install the ntfy app, subscribe to a random topic (e.g. tlsstress-pg-$(openssl rand -hex 6))
# 2) wire it ($0, no account):
printf '%s' 'https://ntfy.sh/<your-random-topic>' > /opt/obs/alertmanager/secrets/ntfy_url
chmod 600 /opt/obs/alertmanager/secrets/ntfy_url
# 3) uncomment the webhook_configs (ntfy) block in the `pager` receiver, then:
docker compose restart alertmanager

Public ntfy topics are unauthenticated — use a hard-to-guess topic (or self-host ntfy / set a topic password) since alert text names hosts. Alternatives in the same slot: PagerDuty (free tier, needs an account → pagerduty_key) or e-mail (global.smtp_* + email_configs). The @channel Slack path works today; this only adds the out-of-band wake-up.


7. Scaling levers (single VM)

  • Metric retention: PROM_RETENTION (env) — raise if disk allows.
  • Log retention: loki/loki-config.yml retention_period.
  • Trace retention: tempo/tempo.yml block_retention (default 7d).
  • Poll cadence: POLL_INTERVAL_SECONDS (default 300; CF/Stripe are daily-ish data).
  • Vertical: resize the Hetzner type (cx23 → cx33) in the Hetzner console; reboot keeps volumes.
  • HA: out of scope today; see ADR-0105 for the 3-node path.

8. Decommission

cd /opt/obs && docker compose down            # keep volumes
docker compose down -v                         # ALSO delete data (irreversible)
# then delete the Hetzner server + DNS record + firewall rule