Skip to content

LLD — Cloud Observability Platform

Low-Level Design. Per-component implementation: images, ports, volumes, config, env. Source of truth: observability/cloud/. Architecture & rationale: HLD. Operations: RUNBOOK.


1. Host

Property Value
Provider Hetzner Cloud
Type cx23 (cx22 is deprecated)
Region hel1 (Helsinki)
Public IP 89.167.3.1
Hostname tlsstress-obs-hel1
OS Ubuntu (GNU userland)
Stack path /opt/obs
Engine Docker + Compose v2, project tlsstress-obs
SSH ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 (always -i)
Firewall Hetzner cloud firewall (Terraform) — inbound 22/80/443 only
DNS obs.tlsstress.art89.167.3.1, Cloudflare DNS-only (grey-cloud)

Grey-cloud is mandatory: Caddy's ACME HTTP-01 challenge must reach the origin directly. Orange-cloud (proxied) breaks issuance unless you switch to the Caddy Cloudflare-DNS plugin.


2. Service inventory

Service Image Host port Internal Volumes Purpose
caddy caddy:2.8-alpine 80, 443 Caddyfile, landing/, caddy_data, caddy_config TLS + reverse proxy + static landing
grafana grafana/grafana:11.3.1 3000 provisioning/, dashboards/, grafana_data UI, alerting
prometheus prom/prometheus:v2.55.1 9090 prometheus/, prometheus_data metrics TSDB + rules
loki grafana/loki:3.2.1 3100 loki-config.yml, loki_data logs
tempo grafana/tempo:2.6.1 3200, 4317, 4318 tempo.yml, tempo_data traces
vector timberio/vector:0.42.0-alpine 8686, 9009 vector.yaml log intake → Loki
blackbox prom/blackbox-exporter:v0.25.0 9115 blackbox.yml synthetics
node-exporter prom/node-exporter:v1.8.2 9100 /:/host:ro, textfile host metrics + textfile
cadvisor gcr.io/cadvisor/cadvisor:v0.49.1 8080 host ro mounts container metrics (by cgroup id)
cloud-poller python:3.12-alpine cloud_poller.py, textfile CF/Stripe/Plausible → textfile
alertmanager prom/alertmanager:v0.27.0 9093 alertmanager/, alertmanager_data alert routing → receivers (webhook/deadman)
pyroscope grafana/pyroscope:1.9.2 4040 pyroscope_data continuous profiling store (4th pillar)
alloy grafana/alloy:v1.5.1 12345, 12347 alloy/config.alloy pprof→Pyroscope + Faro RUM receiver
logs-shipper python:3.12-alpine logs_shipper.py, logs_ckpt, pip_cache CloudWatch Logs + CloudTrail → Loki ($0 pull)
aws-cron-monitor python:3.12-alpine aws_cron_monitor.py, aws-cron-checks.json CloudWatch → healthchecks.io per cron Lambda
docker-names docker:28-cli docker_names.sh, docker.sock (ro), textfile container short-id → name textfile metric
pushgateway prom/pushgateway:v1.10.0 9091 pushgateway_data sink for external-vantage synthetic pushes (scraped honor_labels)
status-gen python:3.12-alpine status_gen.py, status/ writes status.json for the public /status page (+ SLA)
k6-journey grafana/k6:0.54.0 synthetics/journey.js box-vantage multi-step journey → pushgateway (journey_*)
backup-r2 python:3.12-alpine backup_r2.py, state vols (ro), textfile, pip_cache daily off-box backup → Cloudflare R2 (obs/CF/Stripe/AWS/TBI/GitHub/Auth0) + backup_r2_* metrics

Off-box backup (R2) — see DR-RESTORE-RUNBOOK: the backup-r2 service tars the state volumes + exports Cloudflare/Stripe/Auth0 config and mirrors the S3 TBI images + GitHub bundles to r2://tlsstress-obs-backups/ daily, with size-aware retention (≤9 GB, never exceeds the 10 GB free tier) and a healthchecks.io dead-man's switch. It emits backup_r2_* Prometheus metrics (freshness, per-source bytes/objects, per-source success + last-success freshness so a single source failing — e.g. GitHub — is non-silent, total usage, backup_r2_bucket_public + backup_r2_bucket_lock_enabled security gauges) → dashboard backups-dr + alert group backup_dr (incl. R2SourceBackupFailed/R2SourceBackupStale). The RDS pg_dump is the one producer that can't run here (RDS is private) → a scheduled in-VPC container Lambda obs-rds-backup-r2 (backup-rds-lambda/). Creds are dedicated read-only (R2 token, obs-backup-ro IAM, GitHub PAT, Stripe RO key, Auth0 M2M); no secret VALUES go to R2.

Anti-ransomware (immutability): ensure_protection() applies an R2 Bucket Lock retention rule via the Cloudflare API (R2's S3 API implements neither versioning nor Object Lock) so every object is immutable — no delete, no overwrite — for LOCK_RETENTION_DAYS (default 7, kept < KEEP_DAILY so prune only removes unlocked objects). A stolen upload credential therefore can't wipe or encrypt recent DR copies (verified: delete/overwrite → ObjectLockedByBucketPolicy). All writes are write-once (dated github bundles + HEAD-skip on same-day re-runs); R2BucketLockDisabled alerts on tampering.

Named volumes: caddy_data, caddy_config, grafana_data, prometheus_data, pushgateway_data, loki_data, tempo_data, textfile, alertmanager_data, pyroscope_data, logs_ckpt, pip_cache.

Datasources now also include Pyroscope (uid pyroscope) and Alertmanager (uid alertmanager). Alerting: Prometheus rules (alerts.yml + slo-rules.yml) → Alertmanager → receivers; webhook/dead-man's URLs are gated url_files under alertmanager/secrets/ (gitignored). Logs ($0, no Firehose): logs-shipper pulls CloudWatch FilterLogEvents (covered by CloudWatchReadOnlyAccess) + CloudTrail LookupEvents (AWSCloudTrail_ReadOnlyAccess) and pushes to Loki, labelling auth events security="auth". Profiling: Alloy scrapes the stack's /debug/pprof → Pyroscope (dogfood; same pattern instruments the SaaS apps). RUM: Alloy faro.receiver on /faro (CORS-allowed for the app domains) → loki.process → Loki. The loki.process stage promotes Faro app_name to a Loki label app (splits customer-app/admin-console/marketing) and emits Web Vitals as Prometheus histograms — exposed on Alloy /metrics prefixed loki_process_custom_: loki_process_custom_faro_web_vitals_{lcp_ms,ttfb_ms,cls}_bucket + loki_process_custom_faro_exceptions_total (labelled by app). Dashboard rum-frontend and the rum alert group use those names.


3. Caddy (caddy/Caddyfile)

Routes on {$OBS_DOMAIN}:

Match Upstream Notes
/ingest* vector:9009 Firehose / Logpush log delivery
/faro* alloy:12347 RUM (Faro Web SDK), prefix stripped
/otlp* tempo:4318 traces (OTLP-HTTP), token-gated ($OTLP_TOKEN), prefix stripped
/grafana* grafana:3000 prefix preserved (Grafana serve_from_sub_path)
/ (fallback) file_server /srv/landing branded landing, try_files … /index.html

Security headers: HSTS (preload), X-Content-Type-Options, Referrer-Policy, X-Frame-Options: SAMEORIGIN, -Server. Compression: zstd gzip.

Gotcha: use handle /grafana* (not handle_path) — Grafana with serve_from_sub_path=true and root_url=…/grafana/ expects the /grafana prefix on incoming requests. Stripping it → broken assets/redirects.


4. Grafana (env in docker-compose.yml)

Env Value Why
GF_SERVER_ROOT_URL https://obs.tlsstress.art/grafana/ correct link/asset generation under subpath
GF_SERVER_SERVE_FROM_SUB_PATH true serve under /grafana
GF_SERVER_DOMAIN obs.tlsstress.art
GF_SERVER_ENFORCE_DOMAIN false enforce+no-domain = 301 loop behind proxy
GF_AUTH_GENERIC_OAUTH_* Auth0 (gated) see SSO-AUTH0
GF_AWS_default_* RO key (gated) CloudWatch datasource auth
GF_USERS_ALLOW_SIGN_UP false no self-registration

Datasources (grafana/provisioning/datasources/datasources.yml): Prometheus (default, uid prometheus), Loki (loki), Tempo (tempo), CloudWatch (cloudwatch). Cross-links wired: Prometheus exemplars → Tempo, Loki derived fields → Tempo, Tempo → Loki/Prometheus (traces↔logs↔metrics).

CloudWatch auth: authType: keys with secureJsonData.accessKey/secretKey interpolated from env $AWS_RO_ACCESS_KEY_ID/$AWS_RO_SECRET_ACCESS_KEY and defaultRegion: $AWS_REGION. Those exact env names are exposed to the Grafana container in docker-compose.yml. The IAM user is obs-cloudwatch-ro (CloudWatchReadOnlyAccess).

Gotcha (datasource): Grafana provisioning interpolates $VAR/${VAR} but not the bash ${VAR:-default} form (→ "missing default region"); and authType: keys needs the keys in secureJsonData — the legacy GF_AWS_default_* env is not used by this path.

Gotcha (dashboard panels): CloudWatch panel targets must carry the full builder schema — metricQueryType: 0, metricEditorMode: 0, queryMode: "Metrics" and an explicit region (not "default"). Omit them and Grafana 11 falls into Metrics-Insights (SQL) mode with an empty expression → every panel renders "No data" even though the datasource health is OK. See grafana/dashboards/aws-infra.json for the canonical target.

Dashboards (6) auto-provisioned from grafana/dashboards/*.json via grafana/provisioning/dashboards/dashboards.yml: cloud-overview, infra-host-use, slo-synthetics, business-revenue, aws-infra, rum-frontend (Loki/LogQL — Core Web Vitals p75, JS errors, sessions).


5. Prometheus (prometheus/prometheus.yml)

  • scrape_interval: 30s, external label stack=tlsstress-cloud-obs.
  • Retention: --storage.tsdb.retention.time=${PROM_RETENTION:-30d}.
  • --web.enable-remote-write-receiver — receives Tempo span-metrics.
  • Jobs: prometheus, node, cadvisor, blackbox-http (app/admin/marketing), blackbox-edge (F2 cell, http_edge_up module accepting 4xx).
  • Rules: prometheus/alerts.yml — groups synthetics, host, data_sources, business.

6. cloud-poller (exporters/cloud_poller.py)

Stdlib-only Python loop (no third-party deps). Each source is fail-soft: errors are caught, logged, and flagged via a *_scrape_up gauge; the loop never dies. Writes /textfile/cloud.prom atomically; node-exporter exposes it.

Env Meaning
CF_API_TOKEN Cloudflare Analytics:Read token (zones auto-discovered)
CF_ZONE_TAGS optional explicit zone IDs (skip discovery)
STRIPE_API_KEY Stripe restricted read-only key (rk_live_…)
PLAUSIBLE_API_KEY / PLAUSIBLE_SITE_IDS Plausible (deferred)
POLL_INTERVAL_SECONDS default 300

Metrics emitted

Metric Labels Source
cf_requests_total, cf_bytes_total, cf_cached_requests_total, cf_threats_total, cf_encrypted_requests_total, cf_unique_visitors zone Cloudflare
cf_scrape_up Cloudflare
stripe_balance_available_cents, stripe_balance_pending_cents currency Stripe
stripe_active_subscriptions Stripe
stripe_mrr_cents currency Stripe (monthly-normalized)
stripe_charges_today status Stripe
stripe_charge_volume_today_cents currency Stripe
stripe_new_customers_today, stripe_open_disputes Stripe
stripe_scrape_up Stripe
plausible_*, plausible_scrape_up site Plausible (when on)
cloud_poller_last_run_timestamp_seconds poller heartbeat

Cloudflare GraphQL gotchas (already fixed in code): the date variable must be typed Date! (not String!), and orderBy:[date_DESC] is rejected — pin the day with filter:{date:$since} + limit:1. A GraphQL error returns HTTP 200 with {"errors":[…]}; the poller surfaces it as cf_scrape_up=0.

Stripe specifics

  • Auth: Authorization: Bearer <rk_live_…>.
  • Cursor pagination capped at 20 pages (runaway guard); truncation flagged via stripe_*_truncated.
  • MRR normalization: day×365/12, week×52/12, month×1, year÷12, divided by interval_count.
  • Amounts in cents — divide by 100 in dashboards (unit currencyBRL).

7. Loki / Tempo / Vector / blackbox

  • Loki (loki/loki-config.yml): single-binary, filesystem, auth_enabled:false (private network), retention 720h (30d), compactor with retention enabled.
  • Tempo (tempo/tempo.yml): OTLP grpc/http receivers, local storage, 168h (7d) retention, metrics-generator (service-graphs, span-metrics) remote-writing to Prometheus.
  • Vector (vector/vector.yaml): http_server source on :9009 (no source-level auth — Firehose uses X-Amz-Firehose-Access-Key, enforced at Caddy), remap transform tags auth/security events, Loki sink with stack/source_app/security labels.
  • blackbox (blackbox/blackbox.yml): http_2xx (strict: 2xx/3xx/401/403) and http_edge_up (also 4xx, for non-public infra like the F2 cell).
  • cAdvisor + container names: this host runs Docker with the containerd snapshotter + cgroup v2, so cAdvisor cannot resolve the friendly name label — its Docker handler needs the legacy layerdb (absent), and on v0.52+ it drops container metrics entirely (failed to identify the read-write layer ID). So we pin cAdvisor v0.49.1 (Docker factory stays unregistered → the raw/systemd factory still exports the docker-<hash>.scope cgroups, keyed by id), and the docker-names sidecar (docker ps over a read-only socket) emits container_name_info{short_id,name} to the textfile. The Infra/USE dashboard joins them: … label_replace(…, "short_id", "$1", "id", ".*docker-(.{12}).*") * on(short_id) group_left(name) container_name_info.

Gotcha: don't "upgrade" cAdvisor to fix names here — v0.52+ makes it worse (drops the container series). The join is the fix.


8. Network & security model

flowchart TB
  NET["Internet"] -->|"22 / 80 / 443 only<br/>(Hetzner firewall)"| HOST
  subgraph HOST["Hetzner VM"]
    CADDY["caddy :80/:443 (only published ports)"]
    subgraph DKR["docker net: tlsstress-obs_default"]
      ALL["grafana · prometheus · loki · tempo<br/>vector · blackbox · node-exporter · cadvisor · cloud-poller<br/>(expose only — never published)"]
    end
    CADDY --- ALL
  end
  • No service except Caddy binds a host port. Internal services use expose.
  • .env is chmod 600, gitignored; secrets are dedicated read-only keys.
  • Grafana local-admin login remains as break-glass even with SSO/auto_login on.

9. Status page · multi-vantage synthetics · security & predictive alerting

Public status pagestatus-gen (exporters/status_gen.py, stdlib) queries Prometheus every STATUS_INTERVAL and writes status/status.json (overall, per-surface uptime_{24h,7d,30d} + latency_ms, and an slo block: target 99.9%, 30d uptime, error-budget remaining). status/index.html renders it client-side; Caddy handle_path /status* serves /srv/status. Routes added to the Caddyfile: /status* (static) and /synthetic-push/* (token-gated → pushgateway).

Multi-vantage synthetics — three independent vantages, two metric families: - journey_* (box k6 journey): journey_step_up{step,vantage}, journey_step_challenged{step,vantage}, journey_step_duration_seconds, journey_up, journey_total_duration_seconds, journey_last_run_timestamp_seconds. The k6-journey service runs synthetics/journey.js every JOURNEY_INTERVAL and pushes to the internal pushgateway (group job=synthetic-journey/vantage=<v>). - synthetic_probe_* (external vantages): _success, _challenged, _duration_seconds, _status_code, _last_run_timestamp_seconds, labelled by region+target. Pushed via Caddy /synthetic-push (Bearer $SYNTHETIC_PUSH_TOKEN) → pushgateway; Prometheus scrapes it with honor_labels: true. Sources: synthetic-multivantage.yml (Brazil self-hosted runners) and the Cloudflare edge worker (synthetics/edge-worker/). - Challenge awareness: datacenter vantages get a CF managed challenge (403) on CF-fronted surfaces → recorded as *_challenged=1 (not down). X-Tls-Synthetic: $SYNTHETIC_BYPASS_TOKEN + a CF WAF allow-rule lets a vantage reach the origin. - Dashboard synthetics-vantages; alerts JourneyStepDown, JourneyVantageChallenged, JourneyStale, EdgeVantageDown.

Security/log alerting (Loki ruler)loki-config.yml ruler: (storage local/etc/loki/rules, alertmanager_url: http://alertmanager:9093, enable_alertmanager_v2). Rules in loki/rules/fake/security.yml (tenant fake): AuthFailureSpike/Flood, CloudTrailRootActivity, CloudTrailUnauthorizedBurst, CloudTrailSensitiveChange. They fire to the same Slack receiver as Prometheus.

Predictive alertsprometheus/alerts.yml group predictive: HostDiskWillFillSoon / HostMemoryWillExhaust (predict_linear crossing 0 within the horizon) + TLSCertRenewalDue (<30d). TLS posture dashboard tls-posture (cert days-to-expiry, negotiated probe_tls_version_info, chain).

Deploy markers.github/actions/grafana-annotate posts a Grafana annotation (tags deploy,<app>) at the end of customer-app-deploy.yml / admin-console-deploy.yml; token secrets.GRAFANA_ANNOTATION_TOKEN (Editor SA ci-deploy-annotator). No-ops without the token.

10. App DB readiness · AppDBDown · self-healing · multi-DB backup

Born from the 2026-06-16 admin "DB indisponível" incident: the RDS managed master-secret auto-rotation (every 7d) changed the tlsstress_admin password; the admin-console connected as that master via a static secret → auth failed. It was invisible for ~33h because /api/health returns 200 without touching the DB and the app degraded as a graceful 200. Fixes, all live/shipping:

  • Dedicated non-master role — admin now connects as octopus_admin_app (owns the octopus_admin tables), so master rotation can never break it again. (customer-app already uses the dedicated tlsstress_app role.)
  • Deep readiness /api/ready — admin-console (checks admin_db + token_economy_db) and customer-app run SELECT 1 per pool → HTTP 503 if any DB is unreachable (vs the shallow /api/health). On the admin it must be in proxy.ts PUBLIC_PATHS (the proxy gates all /api/*), else it 401s.
  • Detection — Prometheus job blackbox-ready probes both /api/ready URLs; alert AppDBDown = probe_success{job="blackbox-ready"} == 0 (SiteDown is scoped to blackbox-(http|edge) so they don't overlap). blackbox-http also now probes obs.tlsstress.art + status.tlsstress.art.
  • Self-healingobs-db-selfheal Lambda (observability/cloud/selfheal-lambda/, zip, no VPC) on an EventBridge schedule probes each /api/ready; on 503 (app up, DB down) it triggers an App Runner deployment (re-fetches the DB secret + rebuilds the pool), cooldown-guarded against loops. Heals only on 503 (404/403/ timeout → no-op). IAM: apprunner:StartDeployment|ListOperations|DescribeService.
  • Multi-DB RDS backupbackup_aws_r2.py dumps every DB in DB_BACKUP_SECRETS (the managed master → postgres, and admin-database-urloctopus_admin, dumped as its owner role) to aws/<date>/rds-<db>-<HHMMSSZ>.sql.gz. Previously only postgres was dumped — octopus_admin (admin data) was unbacked.
  • Edge alert calibrationEdgeVantageDown excludes managed-challenge results (challenged=1) and the WAF-locked admin surface → no more chronic false-positives.
  • Stripe zero-explicitcloud_poller.py emits stripe_mrr_cents / stripe_charge_volume_today_cents = 0 pre-revenue (panels show 0, not "No data"); the gated-off Plausible panels were removed from the dashboards.