ADR-0106: Self-hosted cloud observability stack (off-AWS, low-cost)¶
- Status: Accepted — scaffold in
observability/cloud/; deployed to a dedicated Hetzner box. - Date: 2026-06-14
- Driver: Need full-stack visibility/health of the cloud control plane —
signin/signup,
app.tlsstress.art(customer-app),admin.tlsstress.art(admin-console), AWS infra, Cloudflare edge, Plausible — without paying for managed OpenSearch (~$54/mo, deleted in the cost-pause) or raising provider spend (Cloudflare stays Free; Plausible stays on its current plan). - Relates to: cost-pause runbook (OpenSearch deleted), ADR-0102 (telemetry was OTel→OpenSearch), ADR-0105 (Hetzner-cell ops pattern reused).
Context¶
The on-prem stress fleet already has Prometheus + Grafana. The cloud side lost its telemetry sink when OpenSearch was deleted (OTel now falls back to a console exporter). We need cloud-side observability that is modern, complete, and cheap, pulling from sources that each expose a free read API:
- AWS → CloudWatch (metrics + logs) — read-only.
- Cloudflare (Free) → GraphQL Analytics API (Logpush is Enterprise-only, so raw edge logs are out; aggregates are enough).
- Plausible (Cloud) → Stats API (included in the existing plan).
- The apps → already emit structured JSON logs to CloudWatch + OTLP traces.
Decision¶
Self-host the Grafana + Prometheus + Loki + Tempo stack (metrics/logs/traces)
plus blackbox (synthetics) and node/cadvisor (host), behind Caddy auto-TLS, on a
dedicated Hetzner box (NOT the F2 cell — RAM + blast-radius isolation).
Everything is provisioned-as-code in observability/cloud/ (docker-compose +
configs + datasource/dashboard provisioning + a stdlib poller for CF/Plausible).
Data ingress, each independent + fail-soft:
- Metrics: Grafana CloudWatch datasource (read-only key, pull) — no metric
streaming cost beyond GetMetricData. Host/container via node/cadvisor.
- Edge + product: cloud_poller.py pulls CF GraphQL + Plausible Stats on an
interval → Prometheus textfile (node-exporter). A source with no token is
skipped (*_scrape_up=0), never breaks the stack.
- Logs: CloudWatch Logs → subscription filter → Kinesis Firehose (HTTP) →
Caddy /ingest → Vector → Loki. Cloudflare Logpush wired but inert on Free.
- Traces: Tempo OTLP; point the apps' OTEL_EXPORTER_OTLP_ENDPOINT here to
restore tracing (+ span-metrics → Prometheus for RED).
Dashboards follow the golden signals / RED / USE convention + uptime/cert + the business funnel (Plausible). Alerting: SiteDown, TLSCertExpiringSoon, HighLatency, host CPU/mem/disk saturation.
Consequences¶
- Cost: providers $0 (CF Free, Plausible included); AWS ~$3–10/mo
(
GetMetricData, tunable via refresh interval); Hetzner ~€5/mo (cx22). ≈$8–15/mo total vs ~$54/mo managed OpenSearch. - Ops you own: upgrades, retention, snapshots of
loki_data/prometheus_datato object storage (durability — single-node SPOF, same lesson as ADR-0105). - Security: only Grafana (TLS+auth) +
/ingest(bearer) are public; Prometheus/Loki/Tempo are internal. Sensitive sink → lock down (VPN/IP-allowlist). - No app code change to start (synthetics + CloudWatch + CF/Plausible pull work immediately); logs + traces need the Firehose + OTEL-endpoint wiring.
Alternatives rejected¶
- Re-enable managed OpenSearch — the ~$54/mo we're avoiding pre-revenue.
- Cloudflare Logpush for raw edge logs — Enterprise-only (multi-$k/mo).
- Co-locate on the F2 cell — RAM pressure + blast radius; a SIEM/observability box must be separate from what it observes.
Addendum — 2026-06-14 (deployed + extended)¶
Deployed to Hetzner cx23 89.167.3.1 (hel1). Beyond the original scope:
- Stripe revenue monitoring —
cloud_pollerpulls balance, MRR (monthly-normalized), active subscriptions, daily charges (succeeded/failed), volume, new customers and open disputes (restricted read-only key). New Business — Revenue & Growth dashboard + business alert rules. - Two more dashboards — Infra — Host & Containers (USE) and SLO — Availability & Error Budget (error-budget burn, p50/p95/p99).
- Auth0 SSO — Grafana generic-OAuth against the
admin.tlsstress.arttenant (dev-jddxc2f5cqlktay1.us.auth0.com); shared tenant session → no re-auth. Gated until the Auth0 app + 2 env values are set. - Branded landing — tlsstress.art identity at
/; Grafana moved to/grafana(serve_from_sub_path). - Plausible — kept wired but off (API needs a paid plan); activates when the product is profitable.
- Full doc suite —
docs/observability/: HLD, LLD, RUNBOOK, SSO-AUTH0, GUIDES.
Activation status (2026-06-14, all LIVE in prod)¶
Cloudflare, Stripe, Auth0 SSO and AWS CloudWatch are live. Secrets are
handed off via AWS Secrets Manager (tlsstress-obs/*) → pulled and piped to the
box over SSH (never via chat). AWS uses a dedicated read-only IAM user
obs-cloudwatch-ro (CloudWatchReadOnlyAccess); the AWS — Infra dashboard
(aws-infra) is the 5th board. CloudWatch datasource: authType: keys +
secureJsonData + defaultRegion: $AWS_REGION (provisioning does not interpolate
${VAR:-default}). Still pending: logs pipeline (Firehose→Loki), traces
(App Runner OTLP→Tempo), Plausible (deferred until profitable).
Cost unchanged (~€5/mo): Stripe/CF/CloudWatch APIs are free-tier; Plausible deferred.