RUNBOOK — Cloud Observability Platform¶
Operational manual for
obs.tlsstress.art. Architecture: HLD. Component details: LLD. How-to guides: GUIDES.
Access: ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 · stack at /opt/obs
· https://obs.tlsstress.art (landing) · …/grafana (UI).
0. Golden commands¶
cd /opt/obs
docker compose ps # status of all services
docker compose logs -f grafana # tail one service
docker compose up -d # apply compose changes (idempotent)
docker compose up -d --force-recreate svc # recreate one service (env/mount change)
docker compose kill -s SIGHUP prometheus # hot-reload Prometheus rules/targets
docker compose restart vector # restart (re-reads bind-mounted config)
grep GF_ADMIN_PASSWORD /opt/obs/.env # local Grafana admin password
Config files are bind-mounted → a
restartre-reads them. Env changes indocker-compose.yml/.envrequireup -d --force-recreate <svc>.
1. Deploy / update from the repo¶
Source of truth is observability/cloud/ in the AI_forSE repo. To push a change:
# from a workstation with the repo + SSH key
cd AI_forSE/observability/cloud
KEY=~/.ssh/tlsstress_f2_hetzner ; BOX=root@89.167.3.1
# clean macOS junk first (tar/scp create ._* AppleDouble files)
find . -name '._*' -delete
scp -i $KEY <changed-file> $BOX:/opt/obs/<path>
ssh -i $KEY $BOX 'cd /opt/obs && docker compose up -d' # or --force-recreate <svc>
Dashboards (grafana/dashboards/*.json) auto-provision within ~10s of landing
on the box — no restart needed. Prometheus rules/targets: kill -s SIGHUP.
2. Health verification¶
# external
curl -sI https://obs.tlsstress.art/ # 200 (landing)
curl -sI https://obs.tlsstress.art/grafana/api/health # 200
# internal (from the box, on the compose network)
NET=tlsstress-obs_default
for s in prometheus:9090/-/healthy loki:3100/ready tempo:3200/ready \
vector:8686/health blackbox:9115/-/healthy grafana:3000/api/health; do
docker run --rm --network $NET curlimages/curl:8.10.1 -s -o /dev/null -w "$s %{http_code}\n" http://$s
done
# data flowing
docker run --rm --network $NET curlimages/curl:8.10.1 -s \
'http://prometheus:9090/api/v1/query?query=up' | python3 -m json.tool | grep -c '"value"'
3. Data-source activation (drop a key, recreate poller/grafana)¶
| Source | Status | .env key(s) |
Apply |
|---|---|---|---|
| Cloudflare | ✅ live | CF_API_TOKEN |
docker compose up -d --force-recreate cloud-poller |
| Stripe | ✅ live | STRIPE_API_KEY (rk_live_… read-only) |
… cloud-poller |
| AWS CloudWatch | ✅ live | AWS_RO_ACCESS_KEY_ID/SECRET + AWS_REGION |
… grafana |
| Auth0 SSO | ✅ live | GRAFANA_AUTH0_ENABLED/CLIENT_ID/SECRET |
… grafana (see SSO) |
| Plausible | 🟡 deferred | PLAUSIBLE_API_KEY + PLAUSIBLE_SITE_IDS |
… cloud-poller |
Verify a poller source: docker compose exec -T cloud-poller sh -c 'cat /textfile/cloud.prom | grep _scrape_up'.
Verify a Grafana datasource: curl -u admin:$PW https://obs.tlsstress.art/grafana/api/datasources/uid/<uid>/health.
3.1 Secret hand-off pattern (AWS Secrets Manager → box .env)¶
Secrets are passed without ever touching chat: the operator stores them in Secrets Manager, then they're pulled and piped to the box over SSH stdin.
# operator (once): aws secretsmanager create-secret --name tlsstress-obs/<x> --secret-string '...'
KEY=$(aws secretsmanager get-secret-value --profile tlsstress-prod --region us-east-1 \
--secret-id tlsstress-obs/<x> --query SecretString --output text)
printf '%s' "$KEY" | ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 \
'k=$(cat); ...write to /opt/obs/.env...; docker compose up -d --force-recreate <svc>'; unset KEY
tlsstress-obs/stripe-ro-key, tlsstress-obs/grafana-auth0
(JSON {client_id,client_secret}), tlsstress-obs/aws-ro (JSON {access_key_id,
secret_access_key,region}). IAM user for AWS: obs-cloudwatch-ro.
4. Backup & restore (DR)¶
State lives in named volumes (grafana_data, prometheus_data, loki_data,
tempo_data, caddy_data). The VM is a SPOF — snapshot off-box.
4.1 On-demand backup¶
cd /opt/obs
ts=$(date -u +%Y%m%dT%H%M%SZ)
for v in grafana_data prometheus_data loki_data caddy_data; do
docker run --rm -v tlsstress-obs_$v:/data -v /opt/obs/backups:/b alpine \
tar czf /b/${v}_${ts}.tgz -C /data .
done
# then push /opt/obs/backups/*.tgz to object storage (S3/B2/R2) via rclone/restic
4.2 Restore a volume¶
docker compose stop <svc>
docker run --rm -v tlsstress-obs_<vol>:/data -v /opt/obs/backups:/b alpine \
sh -c 'rm -rf /data/* && tar xzf /b/<vol>_<ts>.tgz -C /data'
docker compose start <svc>
Scheduled + monitored: scripts/backup.sh runs daily 03:30 UTC (host cron)
and pings a healthchecks.io check (BACKUP_HEALTHCHECK_URL in the cron line)
— /start then /<exit-code>, so a missed or failed backup pages you (cron
monitoring). Check: "tlsstress-obs backups" (tlsstress-obs/healthcheck-backup-url).
TODO (standing): off-box copy — set
RCLONE_REMOTEto an R2/S3 target (free tier) so backups survive box loss. Local snapshots already run.
5. Secret rotation¶
.env holds: GF_ADMIN_PASSWORD, INGEST_TOKEN, CF_API_TOKEN,
STRIPE_API_KEY, AWS_RO_*, GRAFANA_AUTH0_CLIENT_SECRET.
cd /opt/obs
# 1) rotate at the provider (CF/Stripe/AWS/Auth0 dashboard or `aws iam`)
# 2) update .env (keep chmod 600)
chmod 600 .env
# 3) recreate the consumer
docker compose up -d --force-recreate cloud-poller # CF/Stripe
docker compose up -d --force-recreate grafana # AWS/Auth0/admin pw
Pending: the Cloudflare Analytics token was shared in chat — rotate it once dashboards are confirmed (create a fresh Zone-Analytics:Read token, swap, recreate poller).
6. Common incidents¶
| Symptom | Likely cause | Action |
|---|---|---|
*_scrape_up=0 for CF/Stripe |
token invalid/expired or API error | docker compose logs cloud-poller → fix key → recreate |
Vector Restarting (78) |
invalid vector.yaml (exit 78 = config) |
docker compose logs vector → fix config → restart |
| Grafana 301 loop | enforce_domain w/o domain, or root_url/subpath mismatch |
confirm GF_SERVER_DOMAIN + SERVE_FROM_SUB_PATH=true |
Tempo /ready 503 at boot |
startup settling (≤15s after ingester ready) | wait; persistent → logs tempo |
| ACME cert fails | DNS not grey-cloud, or 80 blocked | set Cloudflare DNS-only; open 80 in firewall |
| Disk filling | Loki/Prom retention or backups dir | check df -h; prune /opt/obs/backups; lower retention |
Probe probe_success=0 on F2 |
F2 serves only /trigger (404 on /) |
expected — uses http_edge_up module |
| CloudWatch DS "missing default region" | ${VAR:-default} not interpolated, or no secureJsonData |
use $AWS_REGION (no :-) + authType:keys with secureJsonData keys |
| CloudWatch DS auth error | IAM key not propagated / wrong perms | wait ~15s after key creation; user needs CloudWatchReadOnlyAccess |
| AWS panels all "No data" (DS health OK, query API returns data) | panel targets missing builder fields | add metricQueryType:0+metricEditorMode:0+queryMode:"Metrics"+explicit region to every target; hard-refresh browser |
| Stripe/poller source not updating after key set | poller image runs a stale bind-mounted script | re-scp exporters/cloud_poller.py then up -d --force-recreate cloud-poller |
| Infra → Containers (cAdvisor) panels empty | cAdvisor can't resolve name (Docker containerd-snapshotter + cgroup v2) |
keep cAdvisor v0.49.1 (not v0.52+) + run the docker-names sidecar; dashboard joins on short_id (see LLD §7) |
AppDBDown (/api/ready=503) |
app can't reach a DB — stale credential after RDS master rotation, or RDS down | see §6.1; obs-db-selfheal auto-redeploys — if it persists, check RDS + the app's DB credential |
| "DB indisponível" in the app but blackbox green | shallow /api/health=200 while DB down |
the pre-2026-06-16 blind spot — now caught by AppDBDown; probe /api/ready (deep), not /api/health |
DashboardDBDown / DashboardDown |
operator dashboard's Postgres unreachable (dashboard_db_up=0), or the process/scrape is down (up=0) |
DUT-side alert (not this box). Cockpit shows frozen data + a red "sem dados (DB)" live badge. Check the on-prem PgBouncer/Postgres; the dashboard /api/ready is the deep probe |
RDSConnectionsHigh / RDSStorageLow / RDSCPUHigh / RDSMemoryLow |
RDS saturation before it 503s (pool exhaustion, low disk, CPU/mem pressure) | from aws-rds-metrics (CloudWatch RO key). Thresholds tuned for db.t4g.micro — see §6.2. RDSStorageCritical (<1 GB) is paging-critical: grow storage now |
RDSMetricsBlind (rds_metrics_scrape_ok=0) |
aws-rds-metrics can't read CloudWatch (expired RO key / IAM / throttling) |
RDS saturation alerts are silent until fixed — docker compose logs aws-rds-metrics; check the RO key in .env |
| Critical alert fired but nobody woke up | critical route is Slack-@channel only (no phone pager) |
drop a PagerDuty routing key (or SMTP) into the pager receiver in alertmanager.yml (slot pre-built) — see §6.3 |
6.1 DB unreachable ("DB indisponível") — RCA playbook¶
The 2026-06-16 admin outage: RDS managed master-secret rotation (every 7d) changed
tlsstress_admin's password; the admin connected as that master via a static secret
→ auth failed → graceful 200 page → ~33h invisible (shallow health + no DB alert).
Durable fix (in place): each app connects as a dedicated non-master role
(octopus_admin_app owns admin's octopus_admin; tlsstress_app for customer's
postgres) → master rotation never breaks them. Never point an app's DATABASE_URL
at the master tlsstress_admin. Next master rotation ≈ every 7 days.
If AppDBDown fires:
1. curl -s https://<app>.tlsstress.art/api/ready | jq — see which DB check is ok:false.
2. obs-db-selfheal already redeploys the App Runner service on a 503 (re-fetches the
secret + rebuilds the pool). Watch the deployment; if it recovers, done.
3. Persists? Verify RDS available (aws rds describe-db-instances) + that the app's DB
secret authenticates as its role (not the rotated master).
4. DDL on the private RDS → in-VPC via the NAT instance (i-0894…, SSM): temp SG ingress
to the RDS SG sg-02bf33572b96f2855 + temp IAM read of the managed secret, remove
both after. (Lambdas in sg-043effaa4b9c3bdac already reach RDS — preferred home.)
6.2 RDS saturation (RDS* alerts)¶
The aws-rds-metrics exporter pulls AWS/RDS CloudWatch metrics (RO key, same as the cron
monitor) → :9092/metrics → Prometheus. It pages on saturation before a 503 — the gap
AppDBDown (which only fires once the DB is fully unreachable) couldn't see.
Thresholds are tuned for db.t4g.micro (~112 max connections, 1 GB RAM) — re-tune on
resize (alerts.yml group rds): connections >90, free storage <2 GB (warn) / <1 GB
(critical), CPU >85%, freeable memory <100 MB.
RDSConnectionsHigh→ check PgBouncer / a connection leak; new connections refuse before/api/readyflips, so this fires first.RDSStorageCritical(<1 GB) → RDS hard-stops writes when storage runs out; grow allocated storage now (aws rds modify-db-instance --allocated-storage).RDSMetricsBlind→ the exporter can't read CloudWatch; the otherRDS*alerts are silent until fixed.docker compose logs aws-rds-metrics; verify the RO key in.env.
6.3 Wire a phone pager (critical escalation)¶
critical alerts route to the pager receiver (hourly re-page + Slack @channel). For
guaranteed overnight phone paging, drop a credential into the pre-built slot in
alertmanager.yml. Recommended $0 / no-account path = ntfy.sh:
# 1) install the ntfy app, subscribe to a random topic (e.g. tlsstress-pg-$(openssl rand -hex 6))
# 2) wire it ($0, no account):
printf '%s' 'https://ntfy.sh/<your-random-topic>' > /opt/obs/alertmanager/secrets/ntfy_url
chmod 600 /opt/obs/alertmanager/secrets/ntfy_url
# 3) uncomment the webhook_configs (ntfy) block in the `pager` receiver, then:
docker compose restart alertmanager
Public ntfy topics are unauthenticated — use a hard-to-guess topic (or self-host ntfy / set a
topic password) since alert text names hosts. Alternatives in the same slot: PagerDuty
(free tier, needs an account → pagerduty_key) or e-mail (global.smtp_* + email_configs).
The @channel Slack path works today; this only adds the out-of-band wake-up.
7. Scaling levers (single VM)¶
- Metric retention:
PROM_RETENTION(env) — raise if disk allows. - Log retention:
loki/loki-config.ymlretention_period. - Trace retention:
tempo/tempo.ymlblock_retention(default 7d). - Poll cadence:
POLL_INTERVAL_SECONDS(default 300; CF/Stripe are daily-ish data). - Vertical: resize the Hetzner type (cx23 → cx33) in the Hetzner console; reboot keeps volumes.
- HA: out of scope today; see ADR-0105 for the 3-node path.
8. Decommission¶
cd /opt/obs && docker compose down # keep volumes
docker compose down -v # ALSO delete data (irreversible)
# then delete the Hetzner server + DNS record + firewall rule