LLD — Cloud Observability Platform¶
Low-Level Design. Per-component implementation: images, ports, volumes, config, env. Source of truth:
observability/cloud/. Architecture & rationale: HLD. Operations: RUNBOOK.
1. Host¶
| Property | Value |
|---|---|
| Provider | Hetzner Cloud |
| Type | cx23 (cx22 is deprecated) |
| Region | hel1 (Helsinki) |
| Public IP | 89.167.3.1 |
| Hostname | tlsstress-obs-hel1 |
| OS | Ubuntu (GNU userland) |
| Stack path | /opt/obs |
| Engine | Docker + Compose v2, project tlsstress-obs |
| SSH | ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1 (always -i) |
| Firewall | Hetzner cloud firewall (Terraform) — inbound 22/80/443 only |
| DNS | obs.tlsstress.art → 89.167.3.1, Cloudflare DNS-only (grey-cloud) |
Grey-cloud is mandatory: Caddy's ACME HTTP-01 challenge must reach the origin directly. Orange-cloud (proxied) breaks issuance unless you switch to the Caddy Cloudflare-DNS plugin.
2. Service inventory¶
| Service | Image | Host port | Internal | Volumes | Purpose |
|---|---|---|---|---|---|
| caddy | caddy:2.8-alpine |
80, 443 | — | Caddyfile, landing/, caddy_data, caddy_config |
TLS + reverse proxy + static landing |
| grafana | grafana/grafana:11.3.1 |
— | 3000 | provisioning/, dashboards/, grafana_data |
UI, alerting |
| prometheus | prom/prometheus:v2.55.1 |
— | 9090 | prometheus/, prometheus_data |
metrics TSDB + rules |
| loki | grafana/loki:3.2.1 |
— | 3100 | loki-config.yml, loki_data |
logs |
| tempo | grafana/tempo:2.6.1 |
— | 3200, 4317, 4318 | tempo.yml, tempo_data |
traces |
| vector | timberio/vector:0.42.0-alpine |
— | 8686, 9009 | vector.yaml |
log intake → Loki |
| blackbox | prom/blackbox-exporter:v0.25.0 |
— | 9115 | blackbox.yml |
synthetics |
| node-exporter | prom/node-exporter:v1.8.2 |
— | 9100 | /:/host:ro, textfile |
host metrics + textfile |
| cadvisor | gcr.io/cadvisor/cadvisor:v0.49.1 |
— | 8080 | host ro mounts | container metrics (by cgroup id) |
| cloud-poller | python:3.12-alpine |
— | — | cloud_poller.py, textfile |
CF/Stripe/Plausible → textfile |
| alertmanager | prom/alertmanager:v0.27.0 |
— | 9093 | alertmanager/, alertmanager_data |
alert routing → receivers (webhook/deadman) |
| pyroscope | grafana/pyroscope:1.9.2 |
— | 4040 | pyroscope_data |
continuous profiling store (4th pillar) |
| alloy | grafana/alloy:v1.5.1 |
— | 12345, 12347 | alloy/config.alloy |
pprof→Pyroscope + Faro RUM receiver |
| logs-shipper | python:3.12-alpine |
— | — | logs_shipper.py, logs_ckpt, pip_cache |
CloudWatch Logs + CloudTrail → Loki ($0 pull) |
| aws-cron-monitor | python:3.12-alpine |
— | — | aws_cron_monitor.py, aws-cron-checks.json |
CloudWatch → healthchecks.io per cron Lambda |
| docker-names | docker:28-cli |
— | — | docker_names.sh, docker.sock (ro), textfile |
container short-id → name textfile metric |
| pushgateway | prom/pushgateway:v1.10.0 |
— | 9091 | pushgateway_data |
sink for external-vantage synthetic pushes (scraped honor_labels) |
| status-gen | python:3.12-alpine |
— | — | status_gen.py, status/ |
writes status.json for the public /status page (+ SLA) |
| k6-journey | grafana/k6:0.54.0 |
— | — | synthetics/journey.js |
box-vantage multi-step journey → pushgateway (journey_*) |
| backup-r2 | python:3.12-alpine |
— | — | backup_r2.py, state vols (ro), textfile, pip_cache |
daily off-box backup → Cloudflare R2 (obs/CF/Stripe/AWS/TBI/GitHub/Auth0) + backup_r2_* metrics |
Off-box backup (R2) — see DR-RESTORE-RUNBOOK: the
backup-r2service tars the state volumes + exports Cloudflare/Stripe/Auth0 config and mirrors the S3 TBI images + GitHub bundles tor2://tlsstress-obs-backups/daily, with size-aware retention (≤9 GB, never exceeds the 10 GB free tier) and a healthchecks.io dead-man's switch. It emitsbackup_r2_*Prometheus metrics (freshness, per-source bytes/objects, per-source success + last-success freshness so a single source failing — e.g. GitHub — is non-silent, total usage,backup_r2_bucket_public+backup_r2_bucket_lock_enabledsecurity gauges) → dashboardbackups-dr+ alert groupbackup_dr(incl.R2SourceBackupFailed/R2SourceBackupStale). The RDSpg_dumpis the one producer that can't run here (RDS is private) → a scheduled in-VPC container Lambdaobs-rds-backup-r2(backup-rds-lambda/). Creds are dedicated read-only (R2 token,obs-backup-roIAM, GitHub PAT, Stripe RO key, Auth0 M2M); no secret VALUES go to R2.Anti-ransomware (immutability):
ensure_protection()applies an R2 Bucket Lock retention rule via the Cloudflare API (R2's S3 API implements neither versioning nor Object Lock) so every object is immutable — no delete, no overwrite — forLOCK_RETENTION_DAYS(default 7, kept< KEEP_DAILYso prune only removes unlocked objects). A stolen upload credential therefore can't wipe or encrypt recent DR copies (verified: delete/overwrite →ObjectLockedByBucketPolicy). All writes are write-once (dated github bundles + HEAD-skip on same-day re-runs);R2BucketLockDisabledalerts on tampering.
Named volumes: caddy_data, caddy_config, grafana_data, prometheus_data,
pushgateway_data, loki_data, tempo_data, textfile, alertmanager_data,
pyroscope_data, logs_ckpt, pip_cache.
Datasources now also include Pyroscope (uid pyroscope) and
Alertmanager (uid alertmanager). Alerting: Prometheus rules
(alerts.yml + slo-rules.yml) → Alertmanager → receivers; webhook/dead-man's
URLs are gated url_files under alertmanager/secrets/ (gitignored).
Logs ($0, no Firehose): logs-shipper pulls CloudWatch FilterLogEvents
(covered by CloudWatchReadOnlyAccess) + CloudTrail LookupEvents
(AWSCloudTrail_ReadOnlyAccess) and pushes to Loki, labelling auth events
security="auth". Profiling: Alloy scrapes the stack's /debug/pprof →
Pyroscope (dogfood; same pattern instruments the SaaS apps). RUM: Alloy
faro.receiver on /faro (CORS-allowed for the app domains) → loki.process →
Loki. The loki.process stage promotes Faro app_name to a Loki label app
(splits customer-app/admin-console/marketing) and emits Web Vitals as Prometheus
histograms — exposed on Alloy /metrics prefixed loki_process_custom_:
loki_process_custom_faro_web_vitals_{lcp_ms,ttfb_ms,cls}_bucket +
loki_process_custom_faro_exceptions_total (labelled by app). Dashboard
rum-frontend and the rum alert group use those names.
3. Caddy (caddy/Caddyfile)¶
Routes on {$OBS_DOMAIN}:
| Match | Upstream | Notes |
|---|---|---|
/ingest* |
vector:9009 |
Firehose / Logpush log delivery |
/faro* |
alloy:12347 |
RUM (Faro Web SDK), prefix stripped |
/otlp* |
tempo:4318 |
traces (OTLP-HTTP), token-gated ($OTLP_TOKEN), prefix stripped |
/grafana* |
grafana:3000 |
prefix preserved (Grafana serve_from_sub_path) |
/ (fallback) |
file_server /srv/landing |
branded landing, try_files … /index.html |
Security headers: HSTS (preload), X-Content-Type-Options, Referrer-Policy,
X-Frame-Options: SAMEORIGIN, -Server. Compression: zstd gzip.
Gotcha: use
handle /grafana*(nothandle_path) — Grafana withserve_from_sub_path=trueandroot_url=…/grafana/expects the/grafanaprefix on incoming requests. Stripping it → broken assets/redirects.
4. Grafana (env in docker-compose.yml)¶
| Env | Value | Why |
|---|---|---|
GF_SERVER_ROOT_URL |
https://obs.tlsstress.art/grafana/ |
correct link/asset generation under subpath |
GF_SERVER_SERVE_FROM_SUB_PATH |
true |
serve under /grafana |
GF_SERVER_DOMAIN |
obs.tlsstress.art |
— |
GF_SERVER_ENFORCE_DOMAIN |
false |
enforce+no-domain = 301 loop behind proxy |
GF_AUTH_GENERIC_OAUTH_* |
Auth0 (gated) | see SSO-AUTH0 |
GF_AWS_default_* |
RO key (gated) | CloudWatch datasource auth |
GF_USERS_ALLOW_SIGN_UP |
false |
no self-registration |
Datasources (grafana/provisioning/datasources/datasources.yml): Prometheus
(default, uid prometheus), Loki (loki), Tempo (tempo), CloudWatch
(cloudwatch). Cross-links wired: Prometheus exemplars → Tempo, Loki derived
fields → Tempo, Tempo → Loki/Prometheus (traces↔logs↔metrics).
CloudWatch auth: authType: keys with secureJsonData.accessKey/secretKey
interpolated from env $AWS_RO_ACCESS_KEY_ID/$AWS_RO_SECRET_ACCESS_KEY and
defaultRegion: $AWS_REGION. Those exact env names are exposed to the Grafana
container in docker-compose.yml. The IAM user is obs-cloudwatch-ro
(CloudWatchReadOnlyAccess).
Gotcha (datasource): Grafana provisioning interpolates
$VAR/${VAR}but not the bash${VAR:-default}form (→ "missing default region"); andauthType: keysneeds the keys insecureJsonData— the legacyGF_AWS_default_*env is not used by this path.Gotcha (dashboard panels): CloudWatch panel targets must carry the full builder schema —
metricQueryType: 0,metricEditorMode: 0,queryMode: "Metrics"and an explicitregion(not"default"). Omit them and Grafana 11 falls into Metrics-Insights (SQL) mode with an empty expression → every panel renders "No data" even though the datasource health is OK. Seegrafana/dashboards/aws-infra.jsonfor the canonical target.
Dashboards (6) auto-provisioned from grafana/dashboards/*.json via
grafana/provisioning/dashboards/dashboards.yml: cloud-overview,
infra-host-use, slo-synthetics, business-revenue, aws-infra,
rum-frontend (Loki/LogQL — Core Web Vitals p75, JS errors, sessions).
5. Prometheus (prometheus/prometheus.yml)¶
scrape_interval: 30s, external labelstack=tlsstress-cloud-obs.- Retention:
--storage.tsdb.retention.time=${PROM_RETENTION:-30d}. --web.enable-remote-write-receiver— receives Tempo span-metrics.- Jobs:
prometheus,node,cadvisor,blackbox-http(app/admin/marketing),blackbox-edge(F2 cell,http_edge_upmodule accepting 4xx). - Rules:
prometheus/alerts.yml— groupssynthetics,host,data_sources,business.
6. cloud-poller (exporters/cloud_poller.py)¶
Stdlib-only Python loop (no third-party deps). Each source is fail-soft:
errors are caught, logged, and flagged via a *_scrape_up gauge; the loop never
dies. Writes /textfile/cloud.prom atomically; node-exporter exposes it.
| Env | Meaning |
|---|---|
CF_API_TOKEN |
Cloudflare Analytics:Read token (zones auto-discovered) |
CF_ZONE_TAGS |
optional explicit zone IDs (skip discovery) |
STRIPE_API_KEY |
Stripe restricted read-only key (rk_live_…) |
PLAUSIBLE_API_KEY / PLAUSIBLE_SITE_IDS |
Plausible (deferred) |
POLL_INTERVAL_SECONDS |
default 300 |
Metrics emitted¶
| Metric | Labels | Source |
|---|---|---|
cf_requests_total, cf_bytes_total, cf_cached_requests_total, cf_threats_total, cf_encrypted_requests_total, cf_unique_visitors |
zone |
Cloudflare |
cf_scrape_up |
— | Cloudflare |
stripe_balance_available_cents, stripe_balance_pending_cents |
currency |
Stripe |
stripe_active_subscriptions |
— | Stripe |
stripe_mrr_cents |
currency |
Stripe (monthly-normalized) |
stripe_charges_today |
status |
Stripe |
stripe_charge_volume_today_cents |
currency |
Stripe |
stripe_new_customers_today, stripe_open_disputes |
— | Stripe |
stripe_scrape_up |
— | Stripe |
plausible_*, plausible_scrape_up |
site |
Plausible (when on) |
cloud_poller_last_run_timestamp_seconds |
— | poller heartbeat |
Cloudflare GraphQL gotchas (already fixed in code): the date variable must be typed
Date!(notString!), andorderBy:[date_DESC]is rejected — pin the day withfilter:{date:$since}+limit:1. A GraphQL error returns HTTP 200 with{"errors":[…]}; the poller surfaces it ascf_scrape_up=0.
Stripe specifics¶
- Auth:
Authorization: Bearer <rk_live_…>. - Cursor pagination capped at 20 pages (runaway guard); truncation flagged via
stripe_*_truncated. - MRR normalization: day×365/12, week×52/12, month×1, year÷12, divided by
interval_count. - Amounts in cents — divide by 100 in dashboards (unit
currencyBRL).
7. Loki / Tempo / Vector / blackbox¶
- Loki (
loki/loki-config.yml): single-binary, filesystem,auth_enabled:false(private network), retention 720h (30d), compactor with retention enabled. - Tempo (
tempo/tempo.yml): OTLP grpc/http receivers, local storage, 168h (7d) retention, metrics-generator (service-graphs,span-metrics) remote-writing to Prometheus. - Vector (
vector/vector.yaml):http_serversource on :9009 (no source-level auth — Firehose usesX-Amz-Firehose-Access-Key, enforced at Caddy),remaptransform tags auth/security events, Loki sink withstack/source_app/securitylabels. - blackbox (
blackbox/blackbox.yml):http_2xx(strict: 2xx/3xx/401/403) andhttp_edge_up(also 4xx, for non-public infra like the F2 cell). - cAdvisor + container names: this host runs Docker with the containerd
snapshotter + cgroup v2, so cAdvisor cannot resolve the friendly
namelabel — its Docker handler needs the legacylayerdb(absent), and on v0.52+ it drops container metrics entirely (failed to identify the read-write layer ID). So we pin cAdvisor v0.49.1 (Docker factory stays unregistered → the raw/systemd factory still exports thedocker-<hash>.scopecgroups, keyed byid), and thedocker-namessidecar (docker psover a read-only socket) emitscontainer_name_info{short_id,name}to the textfile. The Infra/USE dashboard joins them:… label_replace(…, "short_id", "$1", "id", ".*docker-(.{12}).*") * on(short_id) group_left(name) container_name_info.
Gotcha: don't "upgrade" cAdvisor to fix names here — v0.52+ makes it worse (drops the container series). The join is the fix.
8. Network & security model¶
flowchart TB
NET["Internet"] -->|"22 / 80 / 443 only<br/>(Hetzner firewall)"| HOST
subgraph HOST["Hetzner VM"]
CADDY["caddy :80/:443 (only published ports)"]
subgraph DKR["docker net: tlsstress-obs_default"]
ALL["grafana · prometheus · loki · tempo<br/>vector · blackbox · node-exporter · cadvisor · cloud-poller<br/>(expose only — never published)"]
end
CADDY --- ALL
end
- No service except Caddy binds a host port. Internal services use
expose. .envischmod 600, gitignored; secrets are dedicated read-only keys.- Grafana local-admin login remains as break-glass even with SSO/auto_login on.
9. Status page · multi-vantage synthetics · security & predictive alerting¶
Public status page — status-gen (exporters/status_gen.py, stdlib) queries
Prometheus every STATUS_INTERVAL and writes status/status.json
(overall, per-surface uptime_{24h,7d,30d} + latency_ms, and an slo block:
target 99.9%, 30d uptime, error-budget remaining). status/index.html renders it
client-side; Caddy handle_path /status* serves /srv/status. Routes added to the
Caddyfile: /status* (static) and /synthetic-push/* (token-gated → pushgateway).
Multi-vantage synthetics — three independent vantages, two metric families:
- journey_* (box k6 journey): journey_step_up{step,vantage},
journey_step_challenged{step,vantage}, journey_step_duration_seconds,
journey_up, journey_total_duration_seconds, journey_last_run_timestamp_seconds.
The k6-journey service runs synthetics/journey.js every JOURNEY_INTERVAL and
pushes to the internal pushgateway (group job=synthetic-journey/vantage=<v>).
- synthetic_probe_* (external vantages): _success, _challenged,
_duration_seconds, _status_code, _last_run_timestamp_seconds, labelled by
region+target. Pushed via Caddy /synthetic-push (Bearer
$SYNTHETIC_PUSH_TOKEN) → pushgateway; Prometheus scrapes it with
honor_labels: true. Sources: synthetic-multivantage.yml (Brazil self-hosted
runners) and the Cloudflare edge worker (synthetics/edge-worker/).
- Challenge awareness: datacenter vantages get a CF managed challenge (403) on
CF-fronted surfaces → recorded as *_challenged=1 (not down). X-Tls-Synthetic:
$SYNTHETIC_BYPASS_TOKEN + a CF WAF allow-rule lets a vantage reach the origin.
- Dashboard synthetics-vantages; alerts JourneyStepDown, JourneyVantageChallenged,
JourneyStale, EdgeVantageDown.
Security/log alerting (Loki ruler) — loki-config.yml ruler: (storage
local → /etc/loki/rules, alertmanager_url: http://alertmanager:9093,
enable_alertmanager_v2). Rules in loki/rules/fake/security.yml (tenant fake):
AuthFailureSpike/Flood, CloudTrailRootActivity, CloudTrailUnauthorizedBurst,
CloudTrailSensitiveChange. They fire to the same Slack receiver as Prometheus.
Predictive alerts — prometheus/alerts.yml group predictive:
HostDiskWillFillSoon / HostMemoryWillExhaust (predict_linear crossing 0
within the horizon) + TLSCertRenewalDue (<30d). TLS posture dashboard
tls-posture (cert days-to-expiry, negotiated probe_tls_version_info, chain).
Deploy markers — .github/actions/grafana-annotate posts a Grafana annotation
(tags deploy,<app>) at the end of customer-app-deploy.yml /
admin-console-deploy.yml; token secrets.GRAFANA_ANNOTATION_TOKEN (Editor SA
ci-deploy-annotator). No-ops without the token.
10. App DB readiness · AppDBDown · self-healing · multi-DB backup¶
Born from the 2026-06-16 admin "DB indisponível" incident: the RDS managed
master-secret auto-rotation (every 7d) changed the tlsstress_admin password; the
admin-console connected as that master via a static secret → auth failed. It was
invisible for ~33h because /api/health returns 200 without touching the DB and
the app degraded as a graceful 200. Fixes, all live/shipping:
- Dedicated non-master role — admin now connects as
octopus_admin_app(owns theoctopus_admintables), so master rotation can never break it again. (customer-app already uses the dedicatedtlsstress_approle.) - Deep readiness
/api/ready— admin-console (checksadmin_db+token_economy_db) and customer-app runSELECT 1per pool → HTTP 503 if any DB is unreachable (vs the shallow/api/health). On the admin it must be inproxy.tsPUBLIC_PATHS(the proxy gates all/api/*), else it 401s. - Detection — Prometheus job
blackbox-readyprobes both/api/readyURLs; alertAppDBDown=probe_success{job="blackbox-ready"} == 0(SiteDownis scoped toblackbox-(http|edge)so they don't overlap).blackbox-httpalso now probesobs.tlsstress.art+status.tlsstress.art. - Self-healing —
obs-db-selfhealLambda (observability/cloud/selfheal-lambda/, zip, no VPC) on an EventBridge schedule probes each/api/ready; on 503 (app up, DB down) it triggers an App Runner deployment (re-fetches the DB secret + rebuilds the pool), cooldown-guarded against loops. Heals only on 503 (404/403/ timeout → no-op). IAM:apprunner:StartDeployment|ListOperations|DescribeService. - Multi-DB RDS backup —
backup_aws_r2.pydumps every DB inDB_BACKUP_SECRETS(the managed master →postgres, andadmin-database-url→octopus_admin, dumped as its owner role) toaws/<date>/rds-<db>-<HHMMSSZ>.sql.gz. Previously onlypostgreswas dumped —octopus_admin(admin data) was unbacked. - Edge alert calibration —
EdgeVantageDownexcludes managed-challenge results (challenged=1) and the WAF-locked admin surface → no more chronic false-positives. - Stripe zero-explicit —
cloud_poller.pyemitsstripe_mrr_cents/stripe_charge_volume_today_cents= 0 pre-revenue (panels show 0, not "No data"); the gated-off Plausible panels were removed from the dashboards.