Skip to content

HLD — Cloud Observability Platform (obs.tlsstress.art)

High-Level Design. Architecture, data flows, design decisions and the "why". For per-component config see LLD; for operations see the RUNBOOK; for the SSO design see SSO-AUTH0. Canonical decision record: ADR-0106.


1. Purpose & scope

A single pane of glass for the tlsstress.art SaaS cloud surfacesapp.tlsstress.art, admin.tlsstress.art, the marketing site, the F2 provisioning cell — plus the business (Stripe), edge (Cloudflare) and infrastructure (AWS) that back them.

It replaces the previously-considered AWS OpenSearch (≈US$ 700+/mo for a production-grade domain) with a self-hosted, off-AWS stack on a single Hetzner VM at ≈ €5/mo, with the same five observability signals plus SaaS-business telemetry that OpenSearch never covered.

Goals

  • Five signals: metrics, logs, traces, synthetics, and pulled SaaS/edge data.
  • SRE methodologies: Golden Signals, RED, USE, SLO/error-budget.
  • Business visibility: MRR, subscriptions, payments, disputes (Stripe).
  • Low cost: filesystem stores, free-tier APIs, one small VM.
  • Single sign-on: same identities as admin.tlsstress.art (Auth0), no re-auth.
  • Secure by default: only 80/443 public; everything else on the compose network.

Non-goals (today)

  • High availability / multi-node (single VM — see §8 Risks).
  • Long-term metric warehousing (30d Prometheus / 30d Loki / 7d Tempo).
  • APM auto-instrumentation (traces require the apps to emit OTLP).

2. Context diagram (C4 level 1)

flowchart LR
  subgraph SaaS["tlsstress.art SaaS (AWS / Cloudflare)"]
    APP["app.tlsstress.art<br/>(App Runner)"]
    ADM["admin.tlsstress.art<br/>(App Runner)"]
    MKT["tlsstress.art<br/>(Cloudflare Workers)"]
    F2["f2.tlsstress.art<br/>(Hetzner cell-0)"]
  end

  subgraph Providers["SaaS provider APIs (free / paid)"]
    CF["Cloudflare<br/>GraphQL Analytics"]
    ST["Stripe API"]
    PL["Plausible Stats<br/>(ready / off)"]
    CW["AWS CloudWatch"]
  end

  OP(["Operator"])

  subgraph OBS["obs.tlsstress.art — Hetzner VM"]
    GRAF["Grafana"]
  end

  APP & ADM & MKT & F2 -. "synthetics (blackbox)" .-> OBS
  APP & ADM -. "logs (Firehose)¹" .-> OBS
  APP & ADM -. "traces (OTLP)¹" .-> OBS
  CF & ST & PL --> OBS
  CW --> GRAF
  OP -->|"Auth0 SSO"| GRAF

  classDef ready stroke-dasharray:4 3;
  class PL ready;

¹ logs (Firehose) and traces (OTLP) are wired and ready but activate when the AWS subscription filters / app OTLP endpoints are pointed here (see §6).


3. Container diagram (C4 level 2)

flowchart TB
  subgraph VM["Hetzner cx23 · /opt/obs · Docker Compose"]
    direction TB
    CADDY["Caddy 2.8<br/>:80 :443 — auto-TLS"]

    subgraph EDGE["Public routes"]
      LAND["/ → landing (static)"]
      GF["/grafana/* → Grafana"]
      ING["/ingest* → Vector"]
    end

    subgraph COLLECT["Collectors"]
      BB["blackbox<br/>synthetics"]
      NE["node-exporter<br/>+ textfile"]
      CAD["cAdvisor"]
      POLL["cloud-poller<br/>CF·Stripe·Plausible"]
      VEC["Vector<br/>log intake"]
    end

    subgraph STORE["Stores (filesystem)"]
      PROM["Prometheus<br/>TSDB 30d"]
      LOKI["Loki<br/>logs 30d"]
      TEMPO["Tempo<br/>traces 7d"]
    end

    GRAFANA["Grafana 11<br/>dashboards + alerting"]
  end

  CADDY --> LAND & GF & ING
  GF --> GRAFANA
  ING --> VEC
  BB & NE & CAD --> PROM
  POLL -->|textfile| NE
  VEC --> LOKI
  TEMPO -->|span-metrics<br/>remote_write| PROM
  GRAFANA --> PROM & LOKI & TEMPO

Key boundary: only Caddy binds host ports (80/443). Every other service is reachable solely on the internal tlsstress-obs_default Docker network and exposed (never published). The host firewall (Hetzner, Terraform-managed) allows only 22/80/443.


4. The five signals → which component

Signal Source Collector Store SRE method
Metrics (infra) host + containers node-exporter, cAdvisor Prometheus USE
Synthetics public URLs blackbox_exporter Prometheus Golden Signals, SLO
Logs CloudWatch/app Vector (/ingest) Loki
Traces apps (OTLP) Tempo Tempo (+ Prom span-metrics) RED
Profiles Go pprof Alloy (pyroscope.scrape) Pyroscope 4th pillar
Logs CloudWatch (app/admin) + CloudTrail logs-shipper (pull, $0) Loki — / SIEM
Edge / Business Cloudflare, Stripe, Plausible cloud-poller → textfile Prometheus business KPIs
AWS CloudWatch Grafana datasource (direct) — (queried live) USE

5. Dashboards (the "advanced panels")

Dashboard UID What it answers
Cloud — Overview cloud-overview Golden Signals: are the sites up, fast, secure? + edge + funnel + auth logs
Infra — Host & Containers (USE) infra-host-use Utilization/Saturation/Errors of the VM and every container
SLO — Availability & Error Budget slo-synthetics SLI 30d, error-budget burn, p50/p95/p99 latency, TLS expiry
Business — Revenue & Growth business-revenue MRR, active subs, daily volume, failed payments, disputes, edge, product
AWS — Infra (App Runner · RDS · Lambda) aws-infra App Runner reqs/5xx/latency/CPU/mem, RDS CPU/conns/storage, Lambda crons
RUM — Frontend (Faro) rum-frontend real-user Web Vitals (LCP/TTFB/CLS) + JS exceptions, per app
Synthetics — Vantages & Journeys synthetics-vantages multi-vantage probes (Helsinki box, Brazil, CF edge) + k6 journey, challenge-aware
TLS Posture — Certs, Versions, Chain tls-posture cert days-to-expiry, negotiated TLS version, handshake duration per surface
Backups & DR (R2) backups-dr backup freshness/result, R2 usage vs free tier, per-source bytes, bucket-exposure

These mirror what large orgs run: an executive/business board, an SLO/SRE board, an infra/USE board, a golden-signals overview, a cloud-provider (AWS) board, plus RUM, multi-vantage synthetics, TLS posture, and a backups/DR board — 9 dashboards.


6. Data flows

6.1 Synthetics (active today)

sequenceDiagram
  participant P as Prometheus
  participant B as blackbox
  participant S as Public site
  loop every 30s
    P->>B: /probe?target=https://app.tlsstress.art
    B->>S: HTTP GET (TLS)
    S-->>B: 200 + cert
    B-->>P: probe_success, probe_duration_seconds, probe_ssl_earliest_cert_expiry
  end

6.2 Pulled SaaS/edge metrics (Cloudflare active; Stripe/Plausible ready)

sequenceDiagram
  participant C as cloud-poller
  participant CF as Cloudflare GraphQL
  participant ST as Stripe API
  participant TF as /textfile/cloud.prom
  participant NE as node-exporter
  participant P as Prometheus
  loop every 300s
    C->>CF: GraphQL httpRequests1dGroups
    C->>ST: balance, subscriptions, charges, disputes
    C->>TF: write cf_*, stripe_* (atomic)
  end
  P->>NE: scrape :9100
  NE-->>P: textfile metrics

6.3 Logs (wired, activates with AWS Firehose)

flowchart LR
  CWL["CloudWatch Logs<br/>(signin/signup/app/admin)"]
  -->|subscription filter| FH["Kinesis Firehose<br/>HTTP delivery"]
  -->|"POST /ingest<br/>X-Amz-Firehose-Access-Key"| CADDY[Caddy]
  --> VEC["Vector<br/>remap: tag auth events"]
  --> LOKI[Loki]
  --> GRAF[Grafana logs panel]

6.4 Traces (wired, activates when apps emit OTLP)

Point each app's OTEL_EXPORTER_OTLP_ENDPOINT at the VM (Tempo :4317/:4318). Tempo generates span-metrics + a service graph and remote_writes them into Prometheus, so RED metrics appear even before custom dashboards.

6.5 DB-down detection + self-healing (closed loop)

Hardening from the 2026-06-16 admin "DB indisponível" incident (a stale DB credential served a graceful 200 from the shallow /api/health → invisible for ~33h):

flowchart LR
  subgraph APP["App Runner apps"]
    R["/api/ready<br/>(SELECT 1 per DB → 503)"]
  end
  BB["blackbox-ready<br/>(Prometheus)"]
  AL["AppDBDown<br/>(Alertmanager → Slack)"]
  SH["obs-db-selfheal<br/>(Lambda, EventBridge 5m)"]
  R -->|probe| BB --> AL
  SH -->|probe /api/ready| R
  SH -->|"503 → StartDeployment"| APP
  • Detect: blackbox probes the deep /api/ready (runs SELECT 1 on every DB, returns 503 if any is down) → AppDBDown pages the operator. The shallow /api/health liveness is kept separate.
  • Self-heal: the obs-db-selfheal Lambda (no-VPC, scheduled) redeploys the App Runner service on a 503 — re-fetching the DB secret + rebuilding the pool — cooldown-guarded, heals only on 503. The alert is the human backstop if the auto-heal can't recover.
  • Prevent: apps connect as dedicated non-master roles (so RDS master-secret rotation can't break them); the off-AWS RDS backup now dumps every DB (§ DR-RESTORE).

7. Authentication & SSO (high level)

Grafana delegates login to the same Auth0 tenant as admin.tlsstress.art via OpenID Connect (Grafana "generic OAuth"). Because Auth0 keeps a tenant-wide SSO session cookie, a user already authenticated at admin.tlsstress.art is returned to Grafana silently — no second login.

sequenceDiagram
  actor U as Operator
  participant A as admin.tlsstress.art
  participant Z as Auth0 tenant
  participant G as Grafana (/grafana)
  U->>A: login (user/pass + MFA)
  A->>Z: OIDC
  Z-->>A: session established (SSO cookie set)
  Note over U,G: later, same browser
  U->>G: open /grafana
  G->>Z: /authorize (auto_login)
  Z-->>G: silent assent (SSO cookie present) — NO prompt
  G-->>U: logged in

Full design, the 1-time Auth0 app step, and role mapping: SSO-AUTH0.


8. Risks & non-functionals

Concern Posture Mitigation / roadmap
SPOF (single VM) accepted for cost off-box daily backup → Cloudflare R2 (backup-r2); HA mirrors ADR-0105
Data durability / DR filesystem volumes daily off-box backup → R2 of obs+CF+Stripe+AWS(RDS dump/tfstate/TBI)+GitHub+Auth0, size-aware ≤10 GB free tier, monitored (backups-dr + backup_dr alerts). Full restore: DR-RESTORE-RUNBOOK
Secret exposure .env 0600, gitignored dedicated read-only keys per source (Stripe rk_, AWS RO obs-backup-ro, GitHub PAT, Auth0 M2M, R2 token); no secret VALUES in R2 (only names/ARNs)
Backup tamper / public exposure R2 bucket private (no r2.dev/custom domain; anon GET=400) backup_r2_bucket_public metric + R2BucketPublic critical alert; encrypted at rest + TLS
Public attack surface only 80/443 Caddy security headers, Hetzner firewall, no published service ports
Cost creep free tiers CF Free, Plausible deferred until profitable, Stripe API free
Alert fatigue curated rules golden-signals + business only; see prometheus/alerts.yml

9. Cost model

Item Monthly
Hetzner cx23 (VM + IPv4) ≈ €5
Cloudflare Analytics (Free) €0
Stripe API €0
AWS CloudWatch (read API calls) ≈ €0 (negligible)
Plausible €0 today (deferred; ~US$9/mo when activated)
Total ≈ €5/mo

Compare: AWS OpenSearch production domain ≈ US$ 700+/mo. See ADR-0106 §alternatives.