ADR-0106: Self-hosted cloud observability stack (off-AWS, low-cost)¶

Status: Accepted — scaffold in observability/cloud/; deployed to a dedicated Hetzner box.
Date: 2026-06-14
Driver: Need full-stack visibility/health of the cloud control plane — signin/signup, app.tlsstress.art (customer-app), admin.tlsstress.art (admin-console), AWS infra, Cloudflare edge, Plausible — without paying for managed OpenSearch (~$54/mo, deleted in the cost-pause) or raising provider spend (Cloudflare stays Free; Plausible stays on its current plan).
Relates to: cost-pause runbook (OpenSearch deleted), ADR-0102 (telemetry was OTel→OpenSearch), ADR-0105 (Hetzner-cell ops pattern reused).

Context¶

The on-prem stress fleet already has Prometheus + Grafana. The cloud side lost its telemetry sink when OpenSearch was deleted (OTel now falls back to a console exporter). We need cloud-side observability that is modern, complete, and cheap, pulling from sources that each expose a free read API:

AWS → CloudWatch (metrics + logs) — read-only.
Cloudflare (Free) → GraphQL Analytics API (Logpush is Enterprise-only, so raw edge logs are out; aggregates are enough).
Plausible (Cloud) → Stats API (included in the existing plan).
The apps → already emit structured JSON logs to CloudWatch + OTLP traces.

Decision¶

Self-host the Grafana + Prometheus + Loki + Tempo stack (metrics/logs/traces) plus blackbox (synthetics) and node/cadvisor (host), behind Caddy auto-TLS, on a dedicated Hetzner box (NOT the F2 cell — RAM + blast-radius isolation). Everything is provisioned-as-code in observability/cloud/ (docker-compose + configs + datasource/dashboard provisioning + a stdlib poller for CF/Plausible).

Data ingress, each independent + fail-soft: - Metrics: Grafana CloudWatch datasource (read-only key, pull) — no metric streaming cost beyond GetMetricData. Host/container via node/cadvisor. - Edge + product: cloud_poller.py pulls CF GraphQL + Plausible Stats on an interval → Prometheus textfile (node-exporter). A source with no token is skipped (*_scrape_up=0), never breaks the stack. - Logs: CloudWatch Logs → subscription filter → Kinesis Firehose (HTTP) → Caddy /ingest → Vector → Loki. Cloudflare Logpush wired but inert on Free. - Traces: Tempo OTLP; point the apps' OTEL_EXPORTER_OTLP_ENDPOINT here to restore tracing (+ span-metrics → Prometheus for RED).

Dashboards follow the golden signals / RED / USE convention + uptime/cert + the business funnel (Plausible). Alerting: SiteDown, TLSCertExpiringSoon, HighLatency, host CPU/mem/disk saturation.

Consequences¶

Cost: providers $0 (CF Free, Plausible included); AWS ~$3–10/mo (GetMetricData, tunable via refresh interval); Hetzner ~€5/mo (cx22). ≈$8–15/mo total vs ~$54/mo managed OpenSearch.
Ops you own: upgrades, retention, snapshots of loki_data/prometheus_data to object storage (durability — single-node SPOF, same lesson as ADR-0105).
Security: only Grafana (TLS+auth) + /ingest (bearer) are public; Prometheus/Loki/Tempo are internal. Sensitive sink → lock down (VPN/IP-allowlist).
No app code change to start (synthetics + CloudWatch + CF/Plausible pull work immediately); logs + traces need the Firehose + OTEL-endpoint wiring.

Alternatives rejected¶

Re-enable managed OpenSearch — the ~$54/mo we're avoiding pre-revenue.
Cloudflare Logpush for raw edge logs — Enterprise-only (multi-$k/mo).
Co-locate on the F2 cell — RAM pressure + blast radius; a SIEM/observability box must be separate from what it observes.

Addendum — 2026-06-14 (deployed + extended)¶

Deployed to Hetzner cx23 89.167.3.1 (hel1). Beyond the original scope:

Stripe revenue monitoring — cloud_poller pulls balance, MRR (monthly-normalized), active subscriptions, daily charges (succeeded/failed), volume, new customers and open disputes (restricted read-only key). New Business — Revenue & Growth dashboard + business alert rules.
Two more dashboards — Infra — Host & Containers (USE) and SLO — Availability & Error Budget (error-budget burn, p50/p95/p99).
Auth0 SSO — Grafana generic-OAuth against the admin.tlsstress.art tenant (dev-jddxc2f5cqlktay1.us.auth0.com); shared tenant session → no re-auth. Gated until the Auth0 app + 2 env values are set.
Branded landing — tlsstress.art identity at /; Grafana moved to /grafana (serve_from_sub_path).
Plausible — kept wired but off (API needs a paid plan); activates when the product is profitable.
Full doc suite — docs/observability/: HLD, LLD, RUNBOOK, SSO-AUTH0, GUIDES.

Activation status (2026-06-14, all LIVE in prod)¶

Cloudflare, Stripe, Auth0 SSO and AWS CloudWatch are live. Secrets are handed off via AWS Secrets Manager (tlsstress-obs/*) → pulled and piped to the box over SSH (never via chat). AWS uses a dedicated read-only IAM user obs-cloudwatch-ro (CloudWatchReadOnlyAccess); the AWS — Infra dashboard (aws-infra) is the 5th board. CloudWatch datasource: authType: keys + secureJsonData + defaultRegion: $AWS_REGION (provisioning does not interpolate ${VAR:-default}). Still pending: logs pipeline (Firehose→Loki), traces (App Runner OTLP→Tempo), Plausible (deferred until profitable).

Cost unchanged (~€5/mo): Stripe/CF/CloudWatch APIs are free-tier; Plausible deferred.