HLD — Cloud Observability Platform (`obs.tlsstress.art`)¶

High-Level Design. Architecture, data flows, design decisions and the "why". For per-component config see LLD; for operations see the RUNBOOK; for the SSO design see SSO-AUTH0. Canonical decision record: ADR-0106.

1. Purpose & scope¶

A single pane of glass for the tlsstress.art SaaS cloud surfaces — app.tlsstress.art, admin.tlsstress.art, the marketing site, the F2 provisioning cell — plus the business (Stripe), edge (Cloudflare) and infrastructure (AWS) that back them.

It replaces the previously-considered AWS OpenSearch (≈US$ 700+/mo for a production-grade domain) with a self-hosted, off-AWS stack on a single Hetzner VM at ≈ €5/mo, with the same five observability signals plus SaaS-business telemetry that OpenSearch never covered.

Goals¶

Five signals: metrics, logs, traces, synthetics, and pulled SaaS/edge data.
SRE methodologies: Golden Signals, RED, USE, SLO/error-budget.
Business visibility: MRR, subscriptions, payments, disputes (Stripe).
Low cost: filesystem stores, free-tier APIs, one small VM.
Single sign-on: same identities as admin.tlsstress.art (Auth0), no re-auth.
Secure by default: only 80/443 public; everything else on the compose network.

Non-goals (today)¶

High availability / multi-node (single VM — see §8 Risks).
Long-term metric warehousing (30d Prometheus / 30d Loki / 7d Tempo).
APM auto-instrumentation (traces require the apps to emit OTLP).

2. Context diagram (C4 level 1)¶

flowchart LR
  subgraph SaaS["tlsstress.art SaaS (AWS / Cloudflare)"]
    APP["app.tlsstress.art<br/>(App Runner)"]
    ADM["admin.tlsstress.art<br/>(App Runner)"]
    MKT["tlsstress.art<br/>(Cloudflare Workers)"]
    F2["f2.tlsstress.art<br/>(Hetzner cell-0)"]
  end

  subgraph Providers["SaaS provider APIs (free / paid)"]
    CF["Cloudflare<br/>GraphQL Analytics"]
    ST["Stripe API"]
    PL["Plausible Stats<br/>(ready / off)"]
    CW["AWS CloudWatch"]
  end

  OP(["Operator"])

  subgraph OBS["obs.tlsstress.art — Hetzner VM"]
    GRAF["Grafana"]
  end

  APP & ADM & MKT & F2 -. "synthetics (blackbox)" .-> OBS
  APP & ADM -. "logs (Firehose)¹" .-> OBS
  APP & ADM -. "traces (OTLP)¹" .-> OBS
  CF & ST & PL --> OBS
  CW --> GRAF
  OP -->|"Auth0 SSO"| GRAF

  classDef ready stroke-dasharray:4 3;
  class PL ready;

¹ logs (Firehose) and traces (OTLP) are wired and ready but activate when the AWS subscription filters / app OTLP endpoints are pointed here (see §6).

3. Container diagram (C4 level 2)¶

flowchart TB
  subgraph VM["Hetzner cx23 · /opt/obs · Docker Compose"]
    direction TB
    CADDY["Caddy 2.8<br/>:80 :443 — auto-TLS"]

    subgraph EDGE["Public routes"]
      LAND["/ → landing (static)"]
      GF["/grafana/* → Grafana"]
      ING["/ingest* → Vector"]
    end

    subgraph COLLECT["Collectors"]
      BB["blackbox<br/>synthetics"]
      NE["node-exporter<br/>+ textfile"]
      CAD["cAdvisor"]
      POLL["cloud-poller<br/>CF·Stripe·Plausible"]
      VEC["Vector<br/>log intake"]
    end

    subgraph STORE["Stores (filesystem)"]
      PROM["Prometheus<br/>TSDB 30d"]
      LOKI["Loki<br/>logs 30d"]
      TEMPO["Tempo<br/>traces 7d"]
    end

    GRAFANA["Grafana 11<br/>dashboards + alerting"]
  end

  CADDY --> LAND & GF & ING
  GF --> GRAFANA
  ING --> VEC
  BB & NE & CAD --> PROM
  POLL -->|textfile| NE
  VEC --> LOKI
  TEMPO -->|span-metrics<br/>remote_write| PROM
  GRAFANA --> PROM & LOKI & TEMPO

Key boundary: only Caddy binds host ports (80/443). Every other service is reachable solely on the internal tlsstress-obs_default Docker network and exposed (never published). The host firewall (Hetzner, Terraform-managed) allows only 22/80/443.

4. The five signals → which component¶

Signal	Source	Collector	Store	SRE method
Metrics (infra)	host + containers	node-exporter, cAdvisor	Prometheus	USE
Synthetics	public URLs	blackbox_exporter	Prometheus	Golden Signals, SLO
Logs	CloudWatch/app	Vector (`/ingest`)	Loki	—
Traces	apps (OTLP)	Tempo	Tempo (+ Prom span-metrics)	RED
Profiles	Go pprof	Alloy (`pyroscope.scrape`)	Pyroscope	4th pillar
Logs	CloudWatch (app/admin) + CloudTrail	logs-shipper (pull, $0)	Loki	— / SIEM
Edge / Business	Cloudflare, Stripe, Plausible	cloud-poller → textfile	Prometheus	business KPIs
AWS	CloudWatch	Grafana datasource (direct)	— (queried live)	USE

5. Dashboards (the "advanced panels")¶

Dashboard	UID	What it answers
Cloud — Overview	`cloud-overview`	Golden Signals: are the sites up, fast, secure? + edge + funnel + auth logs
Infra — Host & Containers (USE)	`infra-host-use`	Utilization/Saturation/Errors of the VM and every container
SLO — Availability & Error Budget	`slo-synthetics`	SLI 30d, error-budget burn, p50/p95/p99 latency, TLS expiry
Business — Revenue & Growth	`business-revenue`	MRR, active subs, daily volume, failed payments, disputes, edge, product
AWS — Infra (App Runner · RDS · Lambda)	`aws-infra`	App Runner reqs/5xx/latency/CPU/mem, RDS CPU/conns/storage, Lambda crons
RUM — Frontend (Faro)	`rum-frontend`	real-user Web Vitals (LCP/TTFB/CLS) + JS exceptions, per app
Synthetics — Vantages & Journeys	`synthetics-vantages`	multi-vantage probes (Helsinki box, Brazil, CF edge) + k6 journey, challenge-aware
TLS Posture — Certs, Versions, Chain	`tls-posture`	cert days-to-expiry, negotiated TLS version, handshake duration per surface
Backups & DR (R2)	`backups-dr`	backup freshness/result, R2 usage vs free tier, per-source bytes, bucket-exposure

These mirror what large orgs run: an executive/business board, an SLO/SRE board, an infra/USE board, a golden-signals overview, a cloud-provider (AWS) board, plus RUM, multi-vantage synthetics, TLS posture, and a backups/DR board — 9 dashboards.

6. Data flows¶

6.1 Synthetics (active today)¶

sequenceDiagram
  participant P as Prometheus
  participant B as blackbox
  participant S as Public site
  loop every 30s
    P->>B: /probe?target=https://app.tlsstress.art
    B->>S: HTTP GET (TLS)
    S-->>B: 200 + cert
    B-->>P: probe_success, probe_duration_seconds, probe_ssl_earliest_cert_expiry
  end

6.2 Pulled SaaS/edge metrics (Cloudflare active; Stripe/Plausible ready)¶

sequenceDiagram
  participant C as cloud-poller
  participant CF as Cloudflare GraphQL
  participant ST as Stripe API
  participant TF as /textfile/cloud.prom
  participant NE as node-exporter
  participant P as Prometheus
  loop every 300s
    C->>CF: GraphQL httpRequests1dGroups
    C->>ST: balance, subscriptions, charges, disputes
    C->>TF: write cf_*, stripe_* (atomic)
  end
  P->>NE: scrape :9100
  NE-->>P: textfile metrics

6.3 Logs (wired, activates with AWS Firehose)¶

flowchart LR
  CWL["CloudWatch Logs<br/>(signin/signup/app/admin)"]
  -->|subscription filter| FH["Kinesis Firehose<br/>HTTP delivery"]
  -->|"POST /ingest<br/>X-Amz-Firehose-Access-Key"| CADDY[Caddy]
  --> VEC["Vector<br/>remap: tag auth events"]
  --> LOKI[Loki]
  --> GRAF[Grafana logs panel]

6.4 Traces (wired, activates when apps emit OTLP)¶

Point each app's OTEL_EXPORTER_OTLP_ENDPOINT at the VM (Tempo :4317/:4318). Tempo generates span-metrics + a service graph and remote_writes them into Prometheus, so RED metrics appear even before custom dashboards.

6.5 DB-down detection + self-healing (closed loop)¶

Hardening from the 2026-06-16 admin "DB indisponível" incident (a stale DB credential served a graceful 200 from the shallow /api/health → invisible for ~33h):

flowchart LR
  subgraph APP["App Runner apps"]
    R["/api/ready<br/>(SELECT 1 per DB → 503)"]
  end
  BB["blackbox-ready<br/>(Prometheus)"]
  AL["AppDBDown<br/>(Alertmanager → Slack)"]
  SH["obs-db-selfheal<br/>(Lambda, EventBridge 5m)"]
  R -->|probe| BB --> AL
  SH -->|probe /api/ready| R
  SH -->|"503 → StartDeployment"| APP

Detect: blackbox probes the deep /api/ready (runs SELECT 1 on every DB, returns 503 if any is down) → AppDBDown pages the operator. The shallow /api/health liveness is kept separate.
Self-heal: the obs-db-selfheal Lambda (no-VPC, scheduled) redeploys the App Runner service on a 503 — re-fetching the DB secret + rebuilding the pool — cooldown-guarded, heals only on 503. The alert is the human backstop if the auto-heal can't recover.
Prevent: apps connect as dedicated non-master roles (so RDS master-secret rotation can't break them); the off-AWS RDS backup now dumps every DB (§ DR-RESTORE).

7. Authentication & SSO (high level)¶

Grafana delegates login to the same Auth0 tenant as admin.tlsstress.art via OpenID Connect (Grafana "generic OAuth"). Because Auth0 keeps a tenant-wide SSO session cookie, a user already authenticated at admin.tlsstress.art is returned to Grafana silently — no second login.

sequenceDiagram
  actor U as Operator
  participant A as admin.tlsstress.art
  participant Z as Auth0 tenant
  participant G as Grafana (/grafana)
  U->>A: login (user/pass + MFA)
  A->>Z: OIDC
  Z-->>A: session established (SSO cookie set)
  Note over U,G: later, same browser
  U->>G: open /grafana
  G->>Z: /authorize (auto_login)
  Z-->>G: silent assent (SSO cookie present) — NO prompt
  G-->>U: logged in

Full design, the 1-time Auth0 app step, and role mapping: SSO-AUTH0.

8. Risks & non-functionals¶

Concern	Posture	Mitigation / roadmap
SPOF (single VM)	accepted for cost	off-box daily backup → Cloudflare R2 (`backup-r2`); HA mirrors ADR-0105
Data durability / DR	filesystem volumes	✅ daily off-box backup → R2 of obs+CF+Stripe+AWS(RDS dump/tfstate/TBI)+GitHub+Auth0, size-aware ≤10 GB free tier, monitored (`backups-dr` + `backup_dr` alerts). Full restore: DR-RESTORE-RUNBOOK
Secret exposure	`.env` 0600, gitignored	dedicated read-only keys per source (Stripe `rk_`, AWS RO `obs-backup-ro`, GitHub PAT, Auth0 M2M, R2 token); no secret VALUES in R2 (only names/ARNs)
Backup tamper / public exposure	R2 bucket private (no r2.dev/custom domain; anon GET=400)	`backup_r2_bucket_public` metric + `R2BucketPublic` critical alert; encrypted at rest + TLS
Public attack surface	only 80/443	Caddy security headers, Hetzner firewall, no published service ports
Cost creep	free tiers	CF Free, Plausible deferred until profitable, Stripe API free
Alert fatigue	curated rules	golden-signals + business only; see `prometheus/alerts.yml`

9. Cost model¶

Item	Monthly
Hetzner cx23 (VM + IPv4)	≈ €5
Cloudflare Analytics (Free)	€0
Stripe API	€0
AWS CloudWatch (read API calls)	≈ €0 (negligible)
Plausible	€0 today (deferred; ~US$9/mo when activated)
Total	≈ €5/mo

Compare: AWS OpenSearch production domain ≈ US$ 700+/mo. See ADR-0106 §alternatives.

HLD — Cloud Observability Platform (obs.tlsstress.art)¶