HLD — Cloud Observability Platform (obs.tlsstress.art)¶
High-Level Design. Architecture, data flows, design decisions and the "why". For per-component config see LLD; for operations see the RUNBOOK; for the SSO design see SSO-AUTH0. Canonical decision record: ADR-0106.
1. Purpose & scope¶
A single pane of glass for the tlsstress.art SaaS cloud surfaces —
app.tlsstress.art, admin.tlsstress.art, the marketing site, the F2
provisioning cell — plus the business (Stripe), edge (Cloudflare) and
infrastructure (AWS) that back them.
It replaces the previously-considered AWS OpenSearch (≈US$ 700+/mo for a production-grade domain) with a self-hosted, off-AWS stack on a single Hetzner VM at ≈ €5/mo, with the same five observability signals plus SaaS-business telemetry that OpenSearch never covered.
Goals¶
- Five signals: metrics, logs, traces, synthetics, and pulled SaaS/edge data.
- SRE methodologies: Golden Signals, RED, USE, SLO/error-budget.
- Business visibility: MRR, subscriptions, payments, disputes (Stripe).
- Low cost: filesystem stores, free-tier APIs, one small VM.
- Single sign-on: same identities as
admin.tlsstress.art(Auth0), no re-auth. - Secure by default: only 80/443 public; everything else on the compose network.
Non-goals (today)¶
- High availability / multi-node (single VM — see §8 Risks).
- Long-term metric warehousing (30d Prometheus / 30d Loki / 7d Tempo).
- APM auto-instrumentation (traces require the apps to emit OTLP).
2. Context diagram (C4 level 1)¶
flowchart LR
subgraph SaaS["tlsstress.art SaaS (AWS / Cloudflare)"]
APP["app.tlsstress.art<br/>(App Runner)"]
ADM["admin.tlsstress.art<br/>(App Runner)"]
MKT["tlsstress.art<br/>(Cloudflare Workers)"]
F2["f2.tlsstress.art<br/>(Hetzner cell-0)"]
end
subgraph Providers["SaaS provider APIs (free / paid)"]
CF["Cloudflare<br/>GraphQL Analytics"]
ST["Stripe API"]
PL["Plausible Stats<br/>(ready / off)"]
CW["AWS CloudWatch"]
end
OP(["Operator"])
subgraph OBS["obs.tlsstress.art — Hetzner VM"]
GRAF["Grafana"]
end
APP & ADM & MKT & F2 -. "synthetics (blackbox)" .-> OBS
APP & ADM -. "logs (Firehose)¹" .-> OBS
APP & ADM -. "traces (OTLP)¹" .-> OBS
CF & ST & PL --> OBS
CW --> GRAF
OP -->|"Auth0 SSO"| GRAF
classDef ready stroke-dasharray:4 3;
class PL ready;
¹ logs (Firehose) and traces (OTLP) are wired and ready but activate when the AWS subscription filters / app OTLP endpoints are pointed here (see §6).
3. Container diagram (C4 level 2)¶
flowchart TB
subgraph VM["Hetzner cx23 · /opt/obs · Docker Compose"]
direction TB
CADDY["Caddy 2.8<br/>:80 :443 — auto-TLS"]
subgraph EDGE["Public routes"]
LAND["/ → landing (static)"]
GF["/grafana/* → Grafana"]
ING["/ingest* → Vector"]
end
subgraph COLLECT["Collectors"]
BB["blackbox<br/>synthetics"]
NE["node-exporter<br/>+ textfile"]
CAD["cAdvisor"]
POLL["cloud-poller<br/>CF·Stripe·Plausible"]
VEC["Vector<br/>log intake"]
end
subgraph STORE["Stores (filesystem)"]
PROM["Prometheus<br/>TSDB 30d"]
LOKI["Loki<br/>logs 30d"]
TEMPO["Tempo<br/>traces 7d"]
end
GRAFANA["Grafana 11<br/>dashboards + alerting"]
end
CADDY --> LAND & GF & ING
GF --> GRAFANA
ING --> VEC
BB & NE & CAD --> PROM
POLL -->|textfile| NE
VEC --> LOKI
TEMPO -->|span-metrics<br/>remote_write| PROM
GRAFANA --> PROM & LOKI & TEMPO
Key boundary: only Caddy binds host ports (80/443). Every other service
is reachable solely on the internal tlsstress-obs_default Docker network and
exposed (never published). The host firewall (Hetzner, Terraform-managed) allows
only 22/80/443.
4. The five signals → which component¶
| Signal | Source | Collector | Store | SRE method |
|---|---|---|---|---|
| Metrics (infra) | host + containers | node-exporter, cAdvisor | Prometheus | USE |
| Synthetics | public URLs | blackbox_exporter | Prometheus | Golden Signals, SLO |
| Logs | CloudWatch/app | Vector (/ingest) |
Loki | — |
| Traces | apps (OTLP) | Tempo | Tempo (+ Prom span-metrics) | RED |
| Profiles | Go pprof | Alloy (pyroscope.scrape) |
Pyroscope | 4th pillar |
| Logs | CloudWatch (app/admin) + CloudTrail | logs-shipper (pull, $0) | Loki | — / SIEM |
| Edge / Business | Cloudflare, Stripe, Plausible | cloud-poller → textfile | Prometheus | business KPIs |
| AWS | CloudWatch | Grafana datasource (direct) | — (queried live) | USE |
5. Dashboards (the "advanced panels")¶
| Dashboard | UID | What it answers |
|---|---|---|
| Cloud — Overview | cloud-overview |
Golden Signals: are the sites up, fast, secure? + edge + funnel + auth logs |
| Infra — Host & Containers (USE) | infra-host-use |
Utilization/Saturation/Errors of the VM and every container |
| SLO — Availability & Error Budget | slo-synthetics |
SLI 30d, error-budget burn, p50/p95/p99 latency, TLS expiry |
| Business — Revenue & Growth | business-revenue |
MRR, active subs, daily volume, failed payments, disputes, edge, product |
| AWS — Infra (App Runner · RDS · Lambda) | aws-infra |
App Runner reqs/5xx/latency/CPU/mem, RDS CPU/conns/storage, Lambda crons |
| RUM — Frontend (Faro) | rum-frontend |
real-user Web Vitals (LCP/TTFB/CLS) + JS exceptions, per app |
| Synthetics — Vantages & Journeys | synthetics-vantages |
multi-vantage probes (Helsinki box, Brazil, CF edge) + k6 journey, challenge-aware |
| TLS Posture — Certs, Versions, Chain | tls-posture |
cert days-to-expiry, negotiated TLS version, handshake duration per surface |
| Backups & DR (R2) | backups-dr |
backup freshness/result, R2 usage vs free tier, per-source bytes, bucket-exposure |
These mirror what large orgs run: an executive/business board, an SLO/SRE board, an infra/USE board, a golden-signals overview, a cloud-provider (AWS) board, plus RUM, multi-vantage synthetics, TLS posture, and a backups/DR board — 9 dashboards.
6. Data flows¶
6.1 Synthetics (active today)¶
sequenceDiagram
participant P as Prometheus
participant B as blackbox
participant S as Public site
loop every 30s
P->>B: /probe?target=https://app.tlsstress.art
B->>S: HTTP GET (TLS)
S-->>B: 200 + cert
B-->>P: probe_success, probe_duration_seconds, probe_ssl_earliest_cert_expiry
end
6.2 Pulled SaaS/edge metrics (Cloudflare active; Stripe/Plausible ready)¶
sequenceDiagram
participant C as cloud-poller
participant CF as Cloudflare GraphQL
participant ST as Stripe API
participant TF as /textfile/cloud.prom
participant NE as node-exporter
participant P as Prometheus
loop every 300s
C->>CF: GraphQL httpRequests1dGroups
C->>ST: balance, subscriptions, charges, disputes
C->>TF: write cf_*, stripe_* (atomic)
end
P->>NE: scrape :9100
NE-->>P: textfile metrics
6.3 Logs (wired, activates with AWS Firehose)¶
flowchart LR
CWL["CloudWatch Logs<br/>(signin/signup/app/admin)"]
-->|subscription filter| FH["Kinesis Firehose<br/>HTTP delivery"]
-->|"POST /ingest<br/>X-Amz-Firehose-Access-Key"| CADDY[Caddy]
--> VEC["Vector<br/>remap: tag auth events"]
--> LOKI[Loki]
--> GRAF[Grafana logs panel]
6.4 Traces (wired, activates when apps emit OTLP)¶
Point each app's OTEL_EXPORTER_OTLP_ENDPOINT at the VM (Tempo :4317/:4318).
Tempo generates span-metrics + a service graph and remote_writes them
into Prometheus, so RED metrics appear even before custom dashboards.
6.5 DB-down detection + self-healing (closed loop)¶
Hardening from the 2026-06-16 admin "DB indisponível" incident (a stale DB credential
served a graceful 200 from the shallow /api/health → invisible for ~33h):
flowchart LR
subgraph APP["App Runner apps"]
R["/api/ready<br/>(SELECT 1 per DB → 503)"]
end
BB["blackbox-ready<br/>(Prometheus)"]
AL["AppDBDown<br/>(Alertmanager → Slack)"]
SH["obs-db-selfheal<br/>(Lambda, EventBridge 5m)"]
R -->|probe| BB --> AL
SH -->|probe /api/ready| R
SH -->|"503 → StartDeployment"| APP
- Detect: blackbox probes the deep
/api/ready(runsSELECT 1on every DB, returns 503 if any is down) →AppDBDownpages the operator. The shallow/api/healthliveness is kept separate. - Self-heal: the
obs-db-selfhealLambda (no-VPC, scheduled) redeploys the App Runner service on a 503 — re-fetching the DB secret + rebuilding the pool — cooldown-guarded, heals only on 503. The alert is the human backstop if the auto-heal can't recover. - Prevent: apps connect as dedicated non-master roles (so RDS master-secret rotation can't break them); the off-AWS RDS backup now dumps every DB (§ DR-RESTORE).
7. Authentication & SSO (high level)¶
Grafana delegates login to the same Auth0 tenant as admin.tlsstress.art
via OpenID Connect (Grafana "generic OAuth"). Because Auth0 keeps a
tenant-wide SSO session cookie, a user already authenticated at
admin.tlsstress.art is returned to Grafana silently — no second login.
sequenceDiagram
actor U as Operator
participant A as admin.tlsstress.art
participant Z as Auth0 tenant
participant G as Grafana (/grafana)
U->>A: login (user/pass + MFA)
A->>Z: OIDC
Z-->>A: session established (SSO cookie set)
Note over U,G: later, same browser
U->>G: open /grafana
G->>Z: /authorize (auto_login)
Z-->>G: silent assent (SSO cookie present) — NO prompt
G-->>U: logged in
Full design, the 1-time Auth0 app step, and role mapping: SSO-AUTH0.
8. Risks & non-functionals¶
| Concern | Posture | Mitigation / roadmap |
|---|---|---|
| SPOF (single VM) | accepted for cost | off-box daily backup → Cloudflare R2 (backup-r2); HA mirrors ADR-0105 |
| Data durability / DR | filesystem volumes | ✅ daily off-box backup → R2 of obs+CF+Stripe+AWS(RDS dump/tfstate/TBI)+GitHub+Auth0, size-aware ≤10 GB free tier, monitored (backups-dr + backup_dr alerts). Full restore: DR-RESTORE-RUNBOOK |
| Secret exposure | .env 0600, gitignored |
dedicated read-only keys per source (Stripe rk_, AWS RO obs-backup-ro, GitHub PAT, Auth0 M2M, R2 token); no secret VALUES in R2 (only names/ARNs) |
| Backup tamper / public exposure | R2 bucket private (no r2.dev/custom domain; anon GET=400) | backup_r2_bucket_public metric + R2BucketPublic critical alert; encrypted at rest + TLS |
| Public attack surface | only 80/443 | Caddy security headers, Hetzner firewall, no published service ports |
| Cost creep | free tiers | CF Free, Plausible deferred until profitable, Stripe API free |
| Alert fatigue | curated rules | golden-signals + business only; see prometheus/alerts.yml |
9. Cost model¶
| Item | Monthly |
|---|---|
| Hetzner cx23 (VM + IPv4) | ≈ €5 |
| Cloudflare Analytics (Free) | €0 |
| Stripe API | €0 |
| AWS CloudWatch (read API calls) | ≈ €0 (negligible) |
| Plausible | €0 today (deferred; ~US$9/mo when activated) |
| Total | ≈ €5/mo |
Compare: AWS OpenSearch production domain ≈ US$ 700+/mo. See ADR-0106 §alternatives.