RUNBOOK — Cloud Observability Platform¶

Operational manual for obs.tlsstress.art. Architecture: HLD. Component details: LLD. How-to guides: GUIDES.

Access: ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 · stack at /opt/obs · https://obs.tlsstress.art (landing) · …/grafana (UI).

0. Golden commands¶

cd /opt/obs
docker compose ps                         # status of all services
docker compose logs -f grafana            # tail one service
docker compose up -d                       # apply compose changes (idempotent)
docker compose up -d --force-recreate svc  # recreate one service (env/mount change)
docker compose kill -s SIGHUP prometheus   # hot-reload Prometheus rules/targets
docker compose restart vector              # restart (re-reads bind-mounted config)
grep GF_ADMIN_PASSWORD /opt/obs/.env       # local Grafana admin password

Config files are bind-mounted → a restart re-reads them. Env changes in docker-compose.yml/.env require up -d --force-recreate <svc>.

1. Deploy / update from the repo¶

Source of truth is observability/cloud/ in the AI_forSE repo. To push a change:

# from a workstation with the repo + SSH key
cd AI_forSE/observability/cloud
KEY=~/.ssh/tlsstress_f2_hetzner ; BOX=root@89.167.3.1
# clean macOS junk first (tar/scp create ._* AppleDouble files)
find . -name '._*' -delete
scp -i $KEY <changed-file> $BOX:/opt/obs/<path>
ssh -i $KEY $BOX 'cd /opt/obs && docker compose up -d'   # or --force-recreate <svc>

Dashboards (grafana/dashboards/*.json) auto-provision within ~10s of landing on the box — no restart needed. Prometheus rules/targets: kill -s SIGHUP.

2. Health verification¶

# external
curl -sI https://obs.tlsstress.art/            # 200 (landing)
curl -sI https://obs.tlsstress.art/grafana/api/health   # 200

# internal (from the box, on the compose network)
NET=tlsstress-obs_default
for s in prometheus:9090/-/healthy loki:3100/ready tempo:3200/ready \
         vector:8686/health blackbox:9115/-/healthy grafana:3000/api/health; do
  docker run --rm --network $NET curlimages/curl:8.10.1 -s -o /dev/null -w "$s %{http_code}\n" http://$s
done

# data flowing
docker run --rm --network $NET curlimages/curl:8.10.1 -s \
  'http://prometheus:9090/api/v1/query?query=up' | python3 -m json.tool | grep -c '"value"'

3. Data-source activation (drop a key, recreate poller/grafana)¶

Source	Status	`.env` key(s)	Apply
Cloudflare	✅ live	`CF_API_TOKEN`	`docker compose up -d --force-recreate cloud-poller`
Stripe	✅ live	`STRIPE_API_KEY` (`rk_live_…` read-only)	`… cloud-poller`
AWS CloudWatch	✅ live	`AWS_RO_ACCESS_KEY_ID/SECRET` + `AWS_REGION`	`… grafana`
Auth0 SSO	✅ live	`GRAFANA_AUTH0_ENABLED/CLIENT_ID/SECRET`	`… grafana` (see SSO)
Plausible	🟡 deferred	`PLAUSIBLE_API_KEY` + `PLAUSIBLE_SITE_IDS`	`… cloud-poller`

Verify a poller source: docker compose exec -T cloud-poller sh -c 'cat /textfile/cloud.prom | grep _scrape_up'. Verify a Grafana datasource: curl -u admin:$PW https://obs.tlsstress.art/grafana/api/datasources/uid/<uid>/health.

3.1 Secret hand-off pattern (AWS Secrets Manager → box `.env`)¶

Secrets are passed without ever touching chat: the operator stores them in Secrets Manager, then they're pulled and piped to the box over SSH stdin.

# operator (once): aws secretsmanager create-secret --name tlsstress-obs/<x> --secret-string '...'
KEY=$(aws secretsmanager get-secret-value --profile tlsstress-prod --region us-east-1 \
  --secret-id tlsstress-obs/<x> --query SecretString --output text)
printf '%s' "$KEY" | ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 \
  'k=$(cat); ...write to /opt/obs/.env...; docker compose up -d --force-recreate <svc>'; unset KEY

Canonical secrets today: tlsstress-obs/stripe-ro-key, tlsstress-obs/grafana-auth0 (JSON {client_id,client_secret}), tlsstress-obs/aws-ro (JSON

{access_key_id,
secret_access_key,region}

). IAM user for AWS: obs-cloudwatch-ro.

4. Backup & restore (DR)¶

State lives in named volumes (grafana_data, prometheus_data, loki_data, tempo_data, caddy_data). The VM is a SPOF — snapshot off-box.

4.1 On-demand backup¶

cd /opt/obs
ts=$(date -u +%Y%m%dT%H%M%SZ)
for v in grafana_data prometheus_data loki_data caddy_data; do
  docker run --rm -v tlsstress-obs_$v:/data -v /opt/obs/backups:/b alpine \
    tar czf /b/${v}_${ts}.tgz -C /data .
done
# then push /opt/obs/backups/*.tgz to object storage (S3/B2/R2) via rclone/restic

4.2 Restore a volume¶

docker compose stop <svc>
docker run --rm -v tlsstress-obs_<vol>:/data -v /opt/obs/backups:/b alpine \
  sh -c 'rm -rf /data/* && tar xzf /b/<vol>_<ts>.tgz -C /data'
docker compose start <svc>

Scheduled + monitored: scripts/backup.sh runs daily 03:30 UTC (host cron) and pings a healthchecks.io check (BACKUP_HEALTHCHECK_URL in the cron line) — /start then /<exit-code>, so a missed or failed backup pages you (cron monitoring). Check: "tlsstress-obs backups" (tlsstress-obs/healthcheck-backup-url).

TODO (standing): off-box copy — set RCLONE_REMOTE to an R2/S3 target (free tier) so backups survive box loss. Local snapshots already run.

5. Secret rotation¶

.env holds: GF_ADMIN_PASSWORD, INGEST_TOKEN, CF_API_TOKEN, STRIPE_API_KEY, AWS_RO_*, GRAFANA_AUTH0_CLIENT_SECRET.

cd /opt/obs
# 1) rotate at the provider (CF/Stripe/AWS/Auth0 dashboard or `aws iam`)
# 2) update .env (keep chmod 600)
chmod 600 .env
# 3) recreate the consumer
docker compose up -d --force-recreate cloud-poller   # CF/Stripe
docker compose up -d --force-recreate grafana        # AWS/Auth0/admin pw

Pending: the Cloudflare Analytics token was shared in chat — rotate it once dashboards are confirmed (create a fresh Zone-Analytics:Read token, swap, recreate poller).

6. Common incidents¶

Symptom	Likely cause	Action
`*_scrape_up=0` for CF/Stripe	token invalid/expired or API error	`docker compose logs cloud-poller` → fix key → recreate
Vector `Restarting (78)`	invalid `vector.yaml` (exit 78 = config)	`docker compose logs vector` → fix config → restart
Grafana 301 loop	`enforce_domain` w/o `domain`, or root_url/subpath mismatch	confirm `GF_SERVER_DOMAIN` + `SERVE_FROM_SUB_PATH=true`
Tempo `/ready` 503 at boot	startup settling (≤15s after ingester ready)	wait; persistent → `logs tempo`
ACME cert fails	DNS not grey-cloud, or 80 blocked	set Cloudflare DNS-only; open 80 in firewall
Disk filling	Loki/Prom retention or backups dir	check `df -h`; prune `/opt/obs/backups`; lower retention
Probe `probe_success=0` on F2	F2 serves only `/trigger` (404 on `/`)	expected — uses `http_edge_up` module
CloudWatch DS "missing default region"	`${VAR:-default}` not interpolated, or no `secureJsonData`	use `$AWS_REGION` (no `:-`) + `authType:keys` with `secureJsonData` keys
CloudWatch DS auth error	IAM key not propagated / wrong perms	wait ~15s after key creation; user needs `CloudWatchReadOnlyAccess`
AWS panels all "No data" (DS health OK, query API returns data)	panel targets missing builder fields	add `metricQueryType:0`+`metricEditorMode:0`+`queryMode:"Metrics"`+explicit `region` to every target; hard-refresh browser
Stripe/poller source not updating after key set	poller image runs a stale bind-mounted script	re-`scp exporters/cloud_poller.py` then `up -d --force-recreate cloud-poller`
Infra → Containers (cAdvisor) panels empty	cAdvisor can't resolve `name` (Docker containerd-snapshotter + cgroup v2)	keep cAdvisor v0.49.1 (not v0.52+) + run the `docker-names` sidecar; dashboard joins on `short_id` (see LLD §7)
`AppDBDown` (`/api/ready`=503)	app can't reach a DB — stale credential after RDS master rotation, or RDS down	see §6.1; `obs-db-selfheal` auto-redeploys — if it persists, check RDS + the app's DB credential
"DB indisponível" in the app but blackbox green	shallow `/api/health`=200 while DB down	the pre-2026-06-16 blind spot — now caught by `AppDBDown`; probe `/api/ready` (deep), not `/api/health`
`DashboardDBDown` / `DashboardDown`	operator dashboard's Postgres unreachable (`dashboard_db_up=0`), or the process/scrape is down (`up=0`)	DUT-side alert (not this box). Cockpit shows frozen data + a red "sem dados (DB)" live badge. Check the on-prem PgBouncer/Postgres; the dashboard `/api/ready` is the deep probe
`RDSConnectionsHigh` / `RDSStorageLow` / `RDSCPUHigh` / `RDSMemoryLow`	RDS saturation before it 503s (pool exhaustion, low disk, CPU/mem pressure)	from `aws-rds-metrics` (CloudWatch RO key). Thresholds tuned for `db.t4g.micro` — see §6.2. `RDSStorageCritical` (<1 GB) is paging-critical: grow storage now
`RDSMetricsBlind` (`rds_metrics_scrape_ok=0`)	`aws-rds-metrics` can't read CloudWatch (expired RO key / IAM / throttling)	RDS saturation alerts are silent until fixed — `docker compose logs aws-rds-metrics`; check the RO key in `.env`
Critical alert fired but nobody woke up	`critical` route is Slack-`@channel` only (no phone pager)	drop a PagerDuty routing key (or SMTP) into the `pager` receiver in `alertmanager.yml` (slot pre-built) — see §6.3

6.1 DB unreachable ("DB indisponível") — RCA playbook¶

The 2026-06-16 admin outage: RDS managed master-secret rotation (every 7d) changed tlsstress_admin's password; the admin connected as that master via a static secret → auth failed → graceful 200 page → ~33h invisible (shallow health + no DB alert).

Durable fix (in place): each app connects as a dedicated non-master role (octopus_admin_app owns admin's octopus_admin; tlsstress_app for customer's postgres) → master rotation never breaks them. Never point an app's DATABASE_URL at the master tlsstress_admin. Next master rotation ≈ every 7 days.

If AppDBDown fires: 1. curl -s https://<app>.tlsstress.art/api/ready | jq — see which DB check is ok:false. 2. obs-db-selfheal already redeploys the App Runner service on a 503 (re-fetches the secret + rebuilds the pool). Watch the deployment; if it recovers, done. 3. Persists? Verify RDS available (aws rds describe-db-instances) + that the app's DB secret authenticates as its role (not the rotated master). 4. DDL on the private RDS → in-VPC via the NAT instance (i-0894…, SSM): temp SG ingress to the RDS SG sg-02bf33572b96f2855 + temp IAM read of the managed secret, remove both after. (Lambdas in sg-043effaa4b9c3bdac already reach RDS — preferred home.)

6.2 RDS saturation (`RDS*` alerts)¶

The aws-rds-metrics exporter pulls AWS/RDS CloudWatch metrics (RO key, same as the cron monitor) → :9092/metrics → Prometheus. It pages on saturation before a 503 — the gap AppDBDown (which only fires once the DB is fully unreachable) couldn't see.

Thresholds are tuned for db.t4g.micro (~112 max connections, 1 GB RAM) — re-tune on resize (alerts.yml group rds): connections >90, free storage <2 GB (warn) / <1 GB (critical), CPU >85%, freeable memory <100 MB.

RDSConnectionsHigh → check PgBouncer / a connection leak; new connections refuse before /api/ready flips, so this fires first.
RDSStorageCritical (<1 GB) → RDS hard-stops writes when storage runs out; grow allocated storage now (aws rds modify-db-instance --allocated-storage).
RDSMetricsBlind → the exporter can't read CloudWatch; the other RDS* alerts are silent until fixed. docker compose logs aws-rds-metrics; verify the RO key in .env.

critical alerts route to the pager receiver (hourly re-page + Slack @channel). For guaranteed overnight phone paging, drop a credential into the pre-built slot in alertmanager.yml. Recommended $0 / no-account path = ntfy.sh:

# 1) install the ntfy app, subscribe to a random topic (e.g. tlsstress-pg-$(openssl rand -hex 6))
# 2) wire it ($0, no account):
printf '%s' 'https://ntfy.sh/<your-random-topic>' > /opt/obs/alertmanager/secrets/ntfy_url
chmod 600 /opt/obs/alertmanager/secrets/ntfy_url
# 3) uncomment the webhook_configs (ntfy) block in the `pager` receiver, then:
docker compose restart alertmanager

Public ntfy topics are unauthenticated — use a hard-to-guess topic (or self-host ntfy / set a topic password) since alert text names hosts. Alternatives in the same slot: PagerDuty (free tier, needs an account → pagerduty_key) or e-mail (global.smtp_* + email_configs). The @channel Slack path works today; this only adds the out-of-band wake-up.

7. Scaling levers (single VM)¶

Metric retention: PROM_RETENTION (env) — raise if disk allows.
Log retention: loki/loki-config.yml retention_period.
Trace retention: tempo/tempo.yml block_retention (default 7d).
Poll cadence: POLL_INTERVAL_SECONDS (default 300; CF/Stripe are daily-ish data).
Vertical: resize the Hetzner type (cx23 → cx33) in the Hetzner console; reboot keeps volumes.
HA: out of scope today; see ADR-0105 for the 3-node path.

8. Decommission¶

cd /opt/obs && docker compose down            # keep volumes
docker compose down -v                         # ALSO delete data (irreversible)
# then delete the Hetzner server + DNS record + firewall rule