LLD — Cloud Observability Platform¶

Low-Level Design. Per-component implementation: images, ports, volumes, config, env. Source of truth: observability/cloud/. Architecture & rationale: HLD. Operations: RUNBOOK.

1. Host¶

Property	Value
Provider	Hetzner Cloud
Type	cx23 (cx22 is deprecated)
Region	`hel1` (Helsinki)
Public IP	`89.167.3.1`
Hostname	`tlsstress-obs-hel1`
OS	Ubuntu (GNU userland)
Stack path	`/opt/obs`
Engine	Docker + Compose v2, project `tlsstress-obs`
SSH	`ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1` (always `-i`)
Firewall	Hetzner cloud firewall (Terraform) — inbound 22/80/443 only
DNS	`obs.tlsstress.art` → `89.167.3.1`, Cloudflare DNS-only (grey-cloud)

Grey-cloud is mandatory: Caddy's ACME HTTP-01 challenge must reach the origin directly. Orange-cloud (proxied) breaks issuance unless you switch to the Caddy Cloudflare-DNS plugin.

2. Service inventory¶

Service	Image	Host port	Internal	Volumes	Purpose
caddy	`caddy:2.8-alpine`	80, 443	—	Caddyfile, `landing/`, `caddy_data`, `caddy_config`	TLS + reverse proxy + static landing
grafana	`grafana/grafana:11.3.1`	—	3000	`provisioning/`, `dashboards/`, `grafana_data`	UI, alerting
prometheus	`prom/prometheus:v2.55.1`	—	9090	`prometheus/`, `prometheus_data`	metrics TSDB + rules
loki	`grafana/loki:3.2.1`	—	3100	`loki-config.yml`, `loki_data`	logs
tempo	`grafana/tempo:2.6.1`	—	3200, 4317, 4318	`tempo.yml`, `tempo_data`	traces
vector	`timberio/vector:0.42.0-alpine`	—	8686, 9009	`vector.yaml`	log intake → Loki
blackbox	`prom/blackbox-exporter:v0.25.0`	—	9115	`blackbox.yml`	synthetics
node-exporter	`prom/node-exporter:v1.8.2`	—	9100	`/:/host:ro`, `textfile`	host metrics + textfile
cadvisor	`gcr.io/cadvisor/cadvisor:v0.49.1`	—	8080	host ro mounts	container metrics (by cgroup `id`)
cloud-poller	`python:3.12-alpine`	—	—	`cloud_poller.py`, `textfile`	CF/Stripe/Plausible → textfile
alertmanager	`prom/alertmanager:v0.27.0`	—	9093	`alertmanager/`, `alertmanager_data`	alert routing → receivers (webhook/deadman)
pyroscope	`grafana/pyroscope:1.9.2`	—	4040	`pyroscope_data`	continuous profiling store (4th pillar)
alloy	`grafana/alloy:v1.5.1`	—	12345, 12347	`alloy/config.alloy`	pprof→Pyroscope + Faro RUM receiver
logs-shipper	`python:3.12-alpine`	—	—	`logs_shipper.py`, `logs_ckpt`, `pip_cache`	CloudWatch Logs + CloudTrail → Loki ($0 pull)
aws-cron-monitor	`python:3.12-alpine`	—	—	`aws_cron_monitor.py`, `aws-cron-checks.json`	CloudWatch → healthchecks.io per cron Lambda
docker-names	`docker:28-cli`	—	—	`docker_names.sh`, docker.sock (ro), `textfile`	container short-id → name textfile metric
pushgateway	`prom/pushgateway:v1.10.0`	—	9091	`pushgateway_data`	sink for external-vantage synthetic pushes (scraped `honor_labels`)
status-gen	`python:3.12-alpine`	—	—	`status_gen.py`, `status/`	writes `status.json` for the public `/status` page (+ SLA)
k6-journey	`grafana/k6:0.54.0`	—	—	`synthetics/journey.js`	box-vantage multi-step journey → pushgateway (`journey_*`)
backup-r2	`python:3.12-alpine`	—	—	`backup_r2.py`, state vols (ro), `textfile`, `pip_cache`	daily off-box backup → Cloudflare R2 (obs/CF/Stripe/AWS/TBI/GitHub/Auth0) + `backup_r2_*` metrics

Off-box backup (R2) — see DR-RESTORE-RUNBOOK: the backup-r2 service tars the state volumes + exports Cloudflare/Stripe/Auth0 config and mirrors the S3 TBI images + GitHub bundles to r2://tlsstress-obs-backups/ daily, with size-aware retention (≤9 GB, never exceeds the 10 GB free tier) and a healthchecks.io dead-man's switch. It emits backup_r2_* Prometheus metrics (freshness, per-source bytes/objects, per-source success + last-success freshness so a single source failing — e.g. GitHub — is non-silent, total usage, backup_r2_bucket_public + backup_r2_bucket_lock_enabled security gauges) → dashboard backups-dr + alert group backup_dr (incl. R2SourceBackupFailed/R2SourceBackupStale). The RDS pg_dump is the one producer that can't run here (RDS is private) → a scheduled in-VPC container Lambda obs-rds-backup-r2 (backup-rds-lambda/). Creds are dedicated read-only (R2 token, obs-backup-ro IAM, GitHub PAT, Stripe RO key, Auth0 M2M); no secret VALUES go to R2.

Anti-ransomware (immutability): ensure_protection() applies an R2 Bucket Lock retention rule via the Cloudflare API (R2's S3 API implements neither versioning nor Object Lock) so every object is immutable — no delete, no overwrite — for LOCK_RETENTION_DAYS (default 7, kept < KEEP_DAILY so prune only removes unlocked objects). A stolen upload credential therefore can't wipe or encrypt recent DR copies (verified: delete/overwrite → ObjectLockedByBucketPolicy). All writes are write-once (dated github bundles + HEAD-skip on same-day re-runs); R2BucketLockDisabled alerts on tampering.

Named volumes: caddy_data, caddy_config, grafana_data, prometheus_data, pushgateway_data, loki_data, tempo_data, textfile, alertmanager_data, pyroscope_data, logs_ckpt, pip_cache.

Datasources now also include Pyroscope (uid pyroscope) and Alertmanager (uid alertmanager). Alerting: Prometheus rules (alerts.yml + slo-rules.yml) → Alertmanager → receivers; webhook/dead-man's URLs are gated url_files under alertmanager/secrets/ (gitignored). Logs ($0, no Firehose): logs-shipper pulls CloudWatch FilterLogEvents (covered by CloudWatchReadOnlyAccess) + CloudTrail LookupEvents (AWSCloudTrail_ReadOnlyAccess) and pushes to Loki, labelling auth events security="auth". Profiling: Alloy scrapes the stack's /debug/pprof → Pyroscope (dogfood; same pattern instruments the SaaS apps). RUM: Alloy faro.receiver on /faro (CORS-allowed for the app domains) → loki.process → Loki. The loki.process stage promotes Faro app_name to a Loki label app (splits customer-app/admin-console/marketing) and emits Web Vitals as Prometheus histograms — exposed on Alloy /metrics prefixed loki_process_custom_: loki_process_custom_faro_web_vitals_{lcp_ms,ttfb_ms,cls}_bucket + loki_process_custom_faro_exceptions_total (labelled by app). Dashboard rum-frontend and the rum alert group use those names.

3. Caddy (`caddy/Caddyfile`)¶

Routes on {$OBS_DOMAIN}:

Match	Upstream	Notes
`/ingest*`	`vector:9009`	Firehose / Logpush log delivery
`/faro*`	`alloy:12347`	RUM (Faro Web SDK), prefix stripped
`/otlp*`	`tempo:4318`	traces (OTLP-HTTP), token-gated (`$OTLP_TOKEN`), prefix stripped
`/grafana*`	`grafana:3000`	prefix preserved (Grafana `serve_from_sub_path`)
`/` (fallback)	`file_server /srv/landing`	branded landing, `try_files … /index.html`

Security headers: HSTS (preload), X-Content-Type-Options, Referrer-Policy, X-Frame-Options: SAMEORIGIN, -Server. Compression: zstd gzip.

Gotcha: use handle /grafana* (not handle_path) — Grafana with serve_from_sub_path=true and root_url=…/grafana/ expects the /grafana prefix on incoming requests. Stripping it → broken assets/redirects.

4. Grafana (env in `docker-compose.yml`)¶

Env	Value	Why
`GF_SERVER_ROOT_URL`	`https://obs.tlsstress.art/grafana/`	correct link/asset generation under subpath
`GF_SERVER_SERVE_FROM_SUB_PATH`	`true`	serve under `/grafana`
`GF_SERVER_DOMAIN`	`obs.tlsstress.art`	—
`GF_SERVER_ENFORCE_DOMAIN`	`false`	enforce+no-domain = 301 loop behind proxy
`GF_AUTH_GENERIC_OAUTH_*`	Auth0 (gated)	see SSO-AUTH0
`GF_AWS_default_*`	RO key (gated)	CloudWatch datasource auth
`GF_USERS_ALLOW_SIGN_UP`	`false`	no self-registration

Datasources (grafana/provisioning/datasources/datasources.yml): Prometheus (default, uid prometheus), Loki (loki), Tempo (tempo), CloudWatch (cloudwatch). Cross-links wired: Prometheus exemplars → Tempo, Loki derived fields → Tempo, Tempo → Loki/Prometheus (traces↔logs↔metrics).

CloudWatch auth: authType: keys with secureJsonData.accessKey/secretKey interpolated from env $AWS_RO_ACCESS_KEY_ID/$AWS_RO_SECRET_ACCESS_KEY and defaultRegion: $AWS_REGION. Those exact env names are exposed to the Grafana container in docker-compose.yml. The IAM user is obs-cloudwatch-ro (CloudWatchReadOnlyAccess).

Gotcha (datasource): Grafana provisioning interpolates $VAR/${VAR} but not the bash ${VAR:-default} form (→ "missing default region"); and authType: keys needs the keys in secureJsonData — the legacy GF_AWS_default_* env is not used by this path.

Gotcha (dashboard panels): CloudWatch panel targets must carry the full builder schema — metricQueryType: 0, metricEditorMode: 0, queryMode: "Metrics" and an explicit region (not "default"). Omit them and Grafana 11 falls into Metrics-Insights (SQL) mode with an empty expression → every panel renders "No data" even though the datasource health is OK. See grafana/dashboards/aws-infra.json for the canonical target.

Dashboards (6) auto-provisioned from grafana/dashboards/*.json via grafana/provisioning/dashboards/dashboards.yml: cloud-overview, infra-host-use, slo-synthetics, business-revenue, aws-infra, rum-frontend (Loki/LogQL — Core Web Vitals p75, JS errors, sessions).

5. Prometheus (`prometheus/prometheus.yml`)¶

scrape_interval: 30s, external label stack=tlsstress-cloud-obs.
Retention: --storage.tsdb.retention.time=${PROM_RETENTION:-30d}.
--web.enable-remote-write-receiver — receives Tempo span-metrics.
Jobs: prometheus, node, cadvisor, blackbox-http (app/admin/marketing), blackbox-edge (F2 cell, http_edge_up module accepting 4xx).
Rules: prometheus/alerts.yml — groups synthetics, host, data_sources, business.

6. cloud-poller (`exporters/cloud_poller.py`)¶

Stdlib-only Python loop (no third-party deps). Each source is fail-soft: errors are caught, logged, and flagged via a *_scrape_up gauge; the loop never dies. Writes /textfile/cloud.prom atomically; node-exporter exposes it.

Env	Meaning
`CF_API_TOKEN`	Cloudflare Analytics:Read token (zones auto-discovered)
`CF_ZONE_TAGS`	optional explicit zone IDs (skip discovery)
`STRIPE_API_KEY`	Stripe restricted read-only key (`rk_live_…`)
`PLAUSIBLE_API_KEY` / `PLAUSIBLE_SITE_IDS`	Plausible (deferred)
`POLL_INTERVAL_SECONDS`	default 300

Metrics emitted¶

Metric	Labels	Source
`cf_requests_total`, `cf_bytes_total`, `cf_cached_requests_total`, `cf_threats_total`, `cf_encrypted_requests_total`, `cf_unique_visitors`	`zone`	Cloudflare
`cf_scrape_up`	—	Cloudflare
`stripe_balance_available_cents`, `stripe_balance_pending_cents`	`currency`	Stripe
`stripe_active_subscriptions`	—	Stripe
`stripe_mrr_cents`	`currency`	Stripe (monthly-normalized)
`stripe_charges_today`	`status`	Stripe
`stripe_charge_volume_today_cents`	`currency`	Stripe
`stripe_new_customers_today`, `stripe_open_disputes`	—	Stripe
`stripe_scrape_up`	—	Stripe
`plausible_*`, `plausible_scrape_up`	`site`	Plausible (when on)
`cloud_poller_last_run_timestamp_seconds`	—	poller heartbeat

Cloudflare GraphQL gotchas (already fixed in code): the date variable must be typed Date! (not String!), and orderBy:[date_DESC] is rejected — pin the day with filter:{date:$since} + limit:1. A GraphQL error returns HTTP 200 with {"errors":[…]}; the poller surfaces it as cf_scrape_up=0.

Stripe specifics¶

Auth: Authorization: Bearer <rk_live_…>.
Cursor pagination capped at 20 pages (runaway guard); truncation flagged via stripe_*_truncated.
MRR normalization: day×365/12, week×52/12, month×1, year÷12, divided by interval_count.
Amounts in cents — divide by 100 in dashboards (unit currencyBRL).

7. Loki / Tempo / Vector / blackbox¶

Loki (loki/loki-config.yml): single-binary, filesystem, auth_enabled:false (private network), retention 720h (30d), compactor with retention enabled.
Tempo (tempo/tempo.yml): OTLP grpc/http receivers, local storage, 168h (7d) retention, metrics-generator (service-graphs, span-metrics) remote-writing to Prometheus.
Vector (vector/vector.yaml): http_server source on :9009 (no source-level auth — Firehose uses X-Amz-Firehose-Access-Key, enforced at Caddy), remap transform tags auth/security events, Loki sink with stack/source_app/security labels.
blackbox (blackbox/blackbox.yml): http_2xx (strict: 2xx/3xx/401/403) and http_edge_up (also 4xx, for non-public infra like the F2 cell).
cAdvisor + container names: this host runs Docker with the containerd snapshotter + cgroup v2, so cAdvisor cannot resolve the friendly name label — its Docker handler needs the legacy layerdb (absent), and on v0.52+ it drops container metrics entirely (failed to identify the read-write layer ID). So we pin cAdvisor v0.49.1 (Docker factory stays unregistered → the raw/systemd factory still exports the docker-<hash>.scope cgroups, keyed by id), and the docker-names sidecar (docker ps over a read-only socket) emits container_name_info{short_id,name} to the textfile. The Infra/USE dashboard joins them: … label_replace(…, "short_id", "$1", "id", ".*docker-(.{12}).*") * on(short_id) group_left(name) container_name_info.

Gotcha: don't "upgrade" cAdvisor to fix names here — v0.52+ makes it worse (drops the container series). The join is the fix.

8. Network & security model¶

flowchart TB
  NET["Internet"] -->|"22 / 80 / 443 only<br/>(Hetzner firewall)"| HOST
  subgraph HOST["Hetzner VM"]
    CADDY["caddy :80/:443 (only published ports)"]
    subgraph DKR["docker net: tlsstress-obs_default"]
      ALL["grafana · prometheus · loki · tempo<br/>vector · blackbox · node-exporter · cadvisor · cloud-poller<br/>(expose only — never published)"]
    end
    CADDY --- ALL
  end

No service except Caddy binds a host port. Internal services use expose.
.env is chmod 600, gitignored; secrets are dedicated read-only keys.
Grafana local-admin login remains as break-glass even with SSO/auto_login on.

9. Status page · multi-vantage synthetics · security & predictive alerting¶

Public status page — status-gen (exporters/status_gen.py, stdlib) queries Prometheus every STATUS_INTERVAL and writes status/status.json (overall, per-surface uptime_{24h,7d,30d} + latency_ms, and an slo block: target 99.9%, 30d uptime, error-budget remaining). status/index.html renders it client-side; Caddy handle_path /status* serves /srv/status. Routes added to the Caddyfile: /status* (static) and /synthetic-push/* (token-gated → pushgateway).

Multi-vantage synthetics — three independent vantages, two metric families: - journey_* (box k6 journey): journey_step_up{step,vantage}, journey_step_challenged{step,vantage}, journey_step_duration_seconds, journey_up, journey_total_duration_seconds, journey_last_run_timestamp_seconds. The k6-journey service runs synthetics/journey.js every JOURNEY_INTERVAL and pushes to the internal pushgateway (group job=synthetic-journey/vantage=<v>). - synthetic_probe_* (external vantages): _success, _challenged, _duration_seconds, _status_code, _last_run_timestamp_seconds, labelled by region+target. Pushed via Caddy /synthetic-push (Bearer $SYNTHETIC_PUSH_TOKEN) → pushgateway; Prometheus scrapes it with honor_labels: true. Sources: synthetic-multivantage.yml (Brazil self-hosted runners) and the Cloudflare edge worker (synthetics/edge-worker/). - Challenge awareness: datacenter vantages get a CF managed challenge (403) on CF-fronted surfaces → recorded as *_challenged=1 (not down). X-Tls-Synthetic: $SYNTHETIC_BYPASS_TOKEN + a CF WAF allow-rule lets a vantage reach the origin. - Dashboard synthetics-vantages; alerts JourneyStepDown, JourneyVantageChallenged, JourneyStale, EdgeVantageDown.

Security/log alerting (Loki ruler) — loki-config.yml ruler: (storage local → /etc/loki/rules, alertmanager_url: http://alertmanager:9093, enable_alertmanager_v2). Rules in loki/rules/fake/security.yml (tenant fake): AuthFailureSpike/Flood, CloudTrailRootActivity, CloudTrailUnauthorizedBurst, CloudTrailSensitiveChange. They fire to the same Slack receiver as Prometheus.

Predictive alerts — prometheus/alerts.yml group predictive: HostDiskWillFillSoon / HostMemoryWillExhaust (predict_linear crossing 0 within the horizon) + TLSCertRenewalDue (<30d). TLS posture dashboard tls-posture (cert days-to-expiry, negotiated probe_tls_version_info, chain).

Deploy markers — .github/actions/grafana-annotate posts a Grafana annotation (tags deploy,<app>) at the end of customer-app-deploy.yml / admin-console-deploy.yml; token secrets.GRAFANA_ANNOTATION_TOKEN (Editor SA ci-deploy-annotator). No-ops without the token.

10. App DB readiness · `AppDBDown` · self-healing · multi-DB backup¶

Born from the 2026-06-16 admin "DB indisponível" incident: the RDS managed master-secret auto-rotation (every 7d) changed the tlsstress_admin password; the admin-console connected as that master via a static secret → auth failed. It was invisible for ~33h because /api/health returns 200 without touching the DB and the app degraded as a graceful 200. Fixes, all live/shipping:

Dedicated non-master role — admin now connects as octopus_admin_app (owns the octopus_admin tables), so master rotation can never break it again. (customer-app already uses the dedicated tlsstress_app role.)
Deep readiness /api/ready — admin-console (checks admin_db + token_economy_db) and customer-app run SELECT 1 per pool → HTTP 503 if any DB is unreachable (vs the shallow /api/health). On the admin it must be in proxy.ts PUBLIC_PATHS (the proxy gates all /api/*), else it 401s.
Detection — Prometheus job blackbox-ready probes both /api/ready URLs; alert AppDBDown = probe_success{job="blackbox-ready"} == 0 (SiteDown is scoped to blackbox-(http|edge) so they don't overlap). blackbox-http also now probes obs.tlsstress.art + status.tlsstress.art.
Self-healing — obs-db-selfheal Lambda (observability/cloud/selfheal-lambda/, zip, no VPC) on an EventBridge schedule probes each /api/ready; on 503 (app up, DB down) it triggers an App Runner deployment (re-fetches the DB secret + rebuilds the pool), cooldown-guarded against loops. Heals only on 503 (404/403/ timeout → no-op). IAM: apprunner:StartDeployment|ListOperations|DescribeService.
Multi-DB RDS backup — backup_aws_r2.py dumps every DB in DB_BACKUP_SECRETS (the managed master → postgres, and admin-database-url → octopus_admin, dumped as its owner role) to aws/<date>/rds-<db>-<HHMMSSZ>.sql.gz. Previously only postgres was dumped — octopus_admin (admin data) was unbacked.
Edge alert calibration — EdgeVantageDown excludes managed-challenge results (challenged=1) and the WAF-locked admin surface → no more chronic false-positives.
Stripe zero-explicit — cloud_poller.py emits stripe_mrr_cents / stripe_charge_volume_today_cents = 0 pre-revenue (panels show 0, not "No data"); the gated-off Plausible panels were removed from the dashboards.