GUIDES — How-to for the Observability Platform¶

Task recipes. Architecture: HLD · components: LLD · operations: RUNBOOK · SSO: SSO-AUTH0. Edit files under observability/cloud/ in the repo, then deploy per RUNBOOK §1.

G1. Add a synthetic target (monitor a new URL)¶

Edit observability/cloud/prometheus/prometheus.yml, job blackbox-http (public sites, strict 2xx) or blackbox-edge (infra, allows 4xx):

    static_configs:
      - targets:
          - https://app.tlsstress.art
          - https://your-new-url.tlsstress.art   # ← add

Deploy + reload: scp … && docker compose kill -s SIGHUP prometheus. It appears in Cloud — Overview and SLO dashboards automatically (they template by instance).

G2. Add an alert rule¶

Append to observability/cloud/prometheus/alerts.yml under a group:

      - alert: MyAlert
        expr: <promql> > <threshold>
        for: 10m
        labels: { severity: warning }
        annotations: { summary: "..." }

Deploy + docker compose kill -s SIGHUP prometheus. Check …/grafana → Alerting, or http://prometheus:9090/rules (internal). Wire a contact point (Slack/email/webhook) in Grafana → Alerting → Contact points.

G3. Add a dashboard¶

Drop a Grafana dashboard JSON into observability/cloud/grafana/dashboards/. Rules: unique uid, datasource refs by uid (prometheus/loki/tempo/cloudwatch), schemaVersion: 39. Validate locally: python3 -c "import json;json.load(open('x.json'))". scp to /opt/obs/grafana/dashboards/ — provisioner picks it up in ~10s.

To export one you built in the UI: Dashboard → Share → Export → Save to file, then commit it (set a stable uid).

CloudWatch panels: each target needs the full builder schema — metricQueryType:0, metricEditorMode:0, queryMode:"Metrics", explicit region (not "default"), dimensions:{Key:["*"]}, matchExact:false. Without them Grafana 11 renders "No data" (it defaults to Metrics-Insights SQL mode). Copy a target from aws-infra.json.

G4. Add a data source¶

Edit observability/cloud/grafana/provisioning/datasources/datasources.yml, add an entry, scp, docker compose up -d --force-recreate grafana. Secrets via secureJsonData sourced from env (never inline).

G5. Activate Plausible (when profitable)¶

Plausible is wired but off (its API needs a paid plan). When tlsstress.art is profitable:

Upgrade the Plausible plan; create a Stats API key.

On the box .env:

PLAUSIBLE_API_KEY=<key>
PLAUSIBLE_SITE_IDS=tlsstress.art,app.tlsstress.art

docker compose up -d --force-recreate cloud-poller.
The Business — Revenue & Growth + Cloud — Overview "Product/funnel" panels light up; plausible_scrape_up → 1.

G6. Activate Stripe monitoring¶

Stripe Dashboard → Developers → API keys → Create restricted key → set every resource to Read (no write). Copy rk_live_….
.env: STRIPE_API_KEY=rk_live_… → up -d --force-recreate cloud-poller.
Business — Revenue & Growth populates (MRR, subs, volume, disputes); stripe_scrape_up → 1.

Use a dedicated read-only key, not the app's live secret key. Amounts are in cents (dashboards divide by 100).

G7. Activate AWS CloudWatch — ✅ LIVE¶

Done in production. IAM user obs-cloudwatch-ro (CloudWatchReadOnlyAccess), keys in Secrets Manager tlsstress-obs/aws-ro, datasource health OK. The AWS — Infra dashboard (aws-infra) shows App Runner (reqs/5xx/latency/CPU/ mem), RDS (CPU/conns/storage) and Lambda crons (invocations/errors/duration).

To re-create from scratch: 1. IAM → user with CloudWatchReadOnlyAccess only → access key. 2. .env: AWS_RO_ACCESS_KEY_ID / AWS_RO_SECRET_ACCESS_KEY / AWS_REGION. 3. up -d --force-recreate grafana.

The datasource uses authType: keys + secureJsonData + defaultRegion: $AWS_REGION. Don't use ${VAR:-default} in provisioning (not interpolated → "missing default region").

G8. Wire logs (AWS → Loki)¶

CloudWatch Logs → log group → Subscription filter → Kinesis Firehose.
Firehose HTTP endpoint destination: https://obs.tlsstress.art/ingest, access key = INGEST_TOKEN (sent as X-Amz-Firehose-Access-Key).
Add the token check in caddy/Caddyfile /ingest handler before going live.
Logs land in Loki; the Cloud — Overview "auth stream" panel filters {stack="cloud", security="auth"} (signin/signup/admin.login).

G9. Wire traces (apps → Tempo)¶

The OTLP ingress is live and token-gated at https://obs.tlsstress.art/otlp (Caddy → Tempo OTLP-HTTP; 401 without the bearer). Token is in Secrets Manager tlsstress-obs/otlp-token. The apps already export OTLP when OTEL_EXPORTER_OTLP_ENDPOINT is set (customer-app/src/lib/observability/otel.ts).

To activate, set on each App Runner service (runtime env — gated):

OTEL_EXPORTER_OTLP_ENDPOINT = https://obs.tlsstress.art/otlp
OTEL_EXPORTER_OTLP_HEADERS  = Authorization=Bearer <tlsstress-obs/otlp-token>

Then redeploy. Tempo auto-generates RED span-metrics + a service graph into Prometheus, and traces correlate with logs (Loki) and metrics (exemplars).

Open-sink protection: the ingress requires the bearer; never expose Tempo 4317/4318 publicly without it.

G10. PromQL / LogQL cookbook¶

# availability SLI (30d) for a site
avg_over_time(probe_success{instance="https://app.tlsstress.art"}[30d]) * 100
# error-budget burn (1h) vs 99.9% SLO
(1 - avg_over_time(probe_success[1h])) / (1 - 0.999)
# p95 latency
quantile_over_time(0.95, probe_duration_seconds[1h])
# container CPU cores
sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[5m]))
# MRR (BRL)
sum(stripe_mrr_cents)/100
# cache hit ratio
cf_cached_requests_total / clamp_min(cf_requests_total,1)

# auth events (signin/signup/admin)
{stack="cloud", security="auth"}
# errors across the cloud stream
{stack="cloud"} |= "error" | json
# CloudTrail audit (SIEM)
{source_app="cloudtrail"}

G11. RUM — frontend (Faro) — ✅ LIVE¶

Shipped: a FaroInit client component (dynamic import → lazy chunk) is wired into customer-app and admin-console (deployed) and marketing (merged, pending a Cloudflare deploy). It auto-points at the prod collector on *.tlsstress.art hosts; data lands in Loki {source_app="rum"}. CSP connect-src allows https://obs.tlsstress.art in each app. Reference pattern (src/components/FaroInit.tsx):

import { initializeFaro } from '@grafana/faro-web-sdk';
initializeFaro({ url: 'https://obs.tlsstress.art/faro/collect', app: { name: 'customer-app' } });

JS errors, Core Web Vitals and sessions then land in Loki (source_app="rum"). CORS is already allowed for the app domains (see alloy/config.alloy).

G12. Paging (Alertmanager → Slack) — ✅ LIVE¶

Prometheus alerts route to Slack via the default receiver's slack_configs (rich format, colour by severity). The incoming-webhook URL lives in alertmanager/secrets/slack_webhook (gitignored). To (re)point it:

printf '%s' 'https://hooks.slack.com/services/XXX' > /opt/obs/alertmanager/secrets/slack_webhook
chmod 644 /opt/obs/alertmanager/secrets/slack_webhook   # container is non-root
cd /opt/obs && docker compose restart alertmanager

⚠️ A Slack incoming webhook needs the {"text":…} shape — use slack_configs (Alertmanager renders it), not the generic webhook_configs (posts raw JSON Slack rejects). Test path: POST /api/v2/alerts then check alertmanager_notifications_total{integration="slack"} climbs, failed flat.

Routing lives in routes: (split critical→pager, warning→chat later). For PagerDuty/Opsgenie use a pagerduty_configs/opsgenie_configs receiver; for Discord, discord_configs. The Watchdog stays on the deadman route (healthchecks.io), not Slack.

G13. Dead-man's-switch — ✅ LIVE¶

The Watchdog alert always fires and is routed (active deadman route in alertmanager/alertmanager.yml) to a healthchecks.io heartbeat every ~1 min; if the pings stop (whole box down), healthchecks.io pages you. To (re)point it: 1. Create a free check at healthchecks.io (Period 5m / Grace 5m) → copy its ping URL. 2. printf '%s' '<ping-url>' > /opt/obs/alertmanager/secrets/deadman_url 3. chmod 644 /opt/obs/alertmanager/secrets/* — ⚠️ the Alertmanager container runs as non-root; url_files at mode 600 (root-only) fail with read url_file: permission denied. 644 is fine (single-tenant box). 4. docker compose restart alertmanager.

Verify: curl -s http://alertmanager:9093/metrics | grep notifications_.*webhook — notifications_total should climb while failed_total stays flat.

G15. AWS cron-Lambda monitoring (healthchecks.io, $0, no Lambda change)¶

aws-cron-monitor (obs box, boto3 + RO key) reads each cron Lambda's CloudWatch Invocations/Errors and pings a per-cron healthchecks.io check: ran clean → ping; errors → /fail; didn't run → no ping → healthchecks alerts on the missed schedule. 8 crons monitored (schedules from EventBridge Scheduler). Config (function → ping_url + lookback) is exporters/aws-cron-checks.json — box-only, gitignored (contains ping URLs); created via the Management API. - Add a cron: create a check (POST /api/v1/checks/ with the EventBridge cron as schedule), add {function: {url, lookback}} to the config, restart the service. - Healthchecks status badges (signed public SVGs from GET /api/v1/badges/) are embedded on the landing (landing/index.html).

G14. Continuous profiling (Pyroscope)¶

Live: Alloy scrapes the stack's own pprof → Pyroscope (Grafana → Explore → Pyroscope datasource). To profile the SaaS Go services, either push with the Pyroscope Go SDK or add their pprof endpoints to alloy/config.alloy's pyroscope.scrape targets.

G16. Public status page (`/status`) — ✅ LIVE¶

obs.tlsstress.art/status is a branded, customer-facing page: per-surface uptime (24h/7d/30d), the 30d SLA + error budget, and live status. It is derived from the same blackbox/SLO data as the operator dashboards (single source of truth).

exporters/status_gen.py (stdlib, the status-gen service) queries Prometheus every STATUS_INTERVAL and writes status/status.json.
status/index.html renders it client-side (auto-refresh 30s); Caddy serves it at the vanity domain status.tlsstress.art (own LE cert) and at obs.tlsstress.art/status.
Add a surface: edit the SERVICES list in status_gen.py, docker compose restart status-gen.
Vanity domain (status.tlsstress.art) — live: a Cloudflare DNS-only (grey-cloud) A record status → the box IP (so Caddy can ACME), plus a dedicated status.tlsstress.art site block in the Caddyfile that serves /srv/status at its apex (the brand mark is served from /srv/landing).

G17. Multi-vantage synthetics — ✅ box live; 🟢/🟡 external one-step¶

No single-VM blind spot: the same surfaces are probed from several vantages, results land under job=synthetic-edge (external) and synthetic-journey (box), viewable in Synthetics — Vantages.

Vantage	What	Activate
`hel1-box`	k6 multi-step journey (`synthetics/journey.js`, `k6-journey` service)	live
`br-selfhosted`	GitHub self-hosted runners (Brazil, residential → reaches origin) push via `synthetic-multivantage.yml`	set repo secret `SYNTHETIC_PUSH_TOKEN` (done)
`GRU` (colo)	Cloudflare edge worker (`synthetics/edge-worker/`) — runs inside CF, reaches the CF-fronted origin	live (deployed; repo secrets `CLOUDFLARE_API_TOKEN`/`_ACCOUNT_ID` drive `edge-synthetic-deploy.yml`)

Cloudflare managed challenge: the box is a datacenter IP and gets a 403 "Just a moment" (cf-mitigated: challenge) on CF-fronted surfaces — this is Bot Fight Mode (free plan), which cannot be skipped by a WAF rule on free. So the box journey classifies CF surfaces as challenged (not down) and verifies non-CF surfaces (F2) for real. The edge worker is the answer: it runs inside Cloudflare, so it reaches the origin (verified: app/tlsstress → 200 from colo GRU). To ALSO let the datacenter box reach the origin, add a CF IP Access Rule → Allow for the box IP (Dashboard → Security → WAF → Tools, or a token with Firewall Services:Edit) — Allow rules skip Bot Fight Mode. Push ingress: Caddy /synthetic-push → Pushgateway, gated by Authorization: Bearer $SYNTHETIC_PUSH_TOKEN; Prometheus scrapes with honor_labels so region/target survive.

G18. Security/log alerting (Loki ruler) — ✅ LIVE¶

The Loki ruler evaluates LogQL alert rules over the SIEM stream and fires to the same Alertmanager → Slack. Rules: loki/rules/fake/security.yml (tenant is fake — auth_enabled: false). Live: auth-failure spike/flood, AWS root activity, unauthorized burst, IAM/access-key/security-group change. - Add a rule: append to security.yml (Prometheus-style alert rule with a LogQL expr), docker compose restart loki. Confirm: curl -H "X-Scope-OrgID: fake" loki:3100/loki/api/v1/rules.

G19. Deploy markers (CI → Grafana annotations) — ✅ LIVE¶

Each customer-app / admin-console App Runner deploy posts a Grafana annotation (tags deploy,<app>) via the composite action .github/actions/grafana-annotate, so a latency/error regression lines up with the release that caused it. Token: repo secret GRAFANA_ANNOTATION_TOKEN (a Grafana Editor service-account token). - Annotate from another workflow: add a final step uses: ./.github/actions/grafana-annotate with app, version, token. - Rotate the token: Grafana → Administration → Service accounts → ci-deploy-annotator → add token → update the repo secret.

G20. Predictive alerts — ✅ LIVE¶

The predictive group (prometheus/alerts.yml) alerts on the trend before the threshold: HostDiskWillFillSoon (predict_linear → 0 within ~4h), HostMemoryWillExhaust (~2h), TLSCertRenewalDue (<30d, gentler than the 14d critical path). Add one with predict_linear(<gauge>[<window>], <horizon_s>) < 0.

G21. Off-box backup → Cloudflare R2 ($0, 10 GB free) — ✅ LIVE¶

Off-site DR copy of everything not in git/Secrets Manager. The backup-r2 service (exporters/backup_r2.py) runs daily and uploads to r2://$R2_BUCKET/: obs-box/ (Hetzner state volumes) · cloudflare/ (DNS/WAF/Workers/page-rules) · stripe/ (full RO export) · aws/ (Terraform state + secrets manifest) · tbi/ (client images mirrored from S3) · github/ (3 repos as full-history bundles) · auth0/ (tenant config). Size-aware retention (age cap KEEP_DAILY + hard R2_MAX_BYTES ≈ 9 GB + a R2_TBI_MAX_BYTES TBI cap) means it can never exceed the free tier (~1.2 GB used). A healthchecks.io dead-man's switch pages on a missed/failed run. The RDS pg_dump can't run on the box (RDS is private) → it runs as a scheduled in-VPC container Lambda obs-rds-backup-r2 (backup-rds-lambda/ + exporters/backup_aws_r2.py, EventBridge every 6h) → aws/<date>/rds-<db>-<HHMMSSZ>.sql.gz for every DB (postgres + octopus_admin). The whole R2 bucket is immutable (R2 Bucket Lock, 7d — anti-ransomware: a stolen upload credential can't delete/encrypt backups), and per-source success/freshness is tracked (backup_r2_source_* + R2SourceBackup* alerts). Restore: DR-RESTORE-RUNBOOK.

Credential scopes (least-privilege). Each source uses a dedicated read-only credential: R2 (derived from the CF token), a read-only AWS IAM user (obs-backup-ro) for S3/SM, a read-only GitHub PAT, a read-only Stripe restricted key, a read-only Auth0 M2M app, and the CF token for DNS/WAF/Workers. Two items need broader read scopes for full coverage (the backup is fail-soft, so the rest still runs): the CF token needs Zone Settings:Read for zone_settings, and the Auth0 M2M app needs read:clients, read:tenant_settings, read:organizations, read:branding, read:prompts (today it has connections/resource-servers/actions/rules/roles).

Activation (one time): 1. Enable R2 on the Cloudflare account (dashboard → R2 → Get Started; free tier = $0 but requires a card on file). Until enabled, the per-account S3 endpoint <account>.r2.cloudflarestorage.com does not serve TLS. 2. Create an R2 API Token (Admin Read & Write) → Access Key ID + Secret. (For a CF API token with R2 perms, the S3 creds are derivable: access_key_id = token id, secret_access_key = SHA-256 of the token value.) 3. .env: R2_ENDPOINT (https://<account>.r2.cloudflarestorage.com), R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_BUCKET, BACKUP_HEALTHCHECK_URL, CF_BACKUP_TOKEN/CF_ACCOUNT_ID/CF_ZONE_ID. 4. docker compose up -d --force-recreate backup-r2, then a one-shot dry run: docker compose exec -T backup-r2 python /app/backup_r2.py --once. 5. Verify objects: aws --endpoint-url $R2 --region auto s3 ls --recursive s3://$R2_BUCKET/. Create a healthchecks.io check (Period 1d / Grace 6h) and set BACKUP_HEALTHCHECK_URL.

boto3 ≥1.36 adds request checksums some S3-compatibles reject — if uploads fail with a checksum error, set AWS_REQUEST_CHECKSUM_CALCULATION=when_required.

G22. App DB readiness + self-healing — ✅ LIVE¶

Closes the 2026-06-16 blind spot where a DB outage hid behind a graceful /api/health 200 (see RUNBOOK §6.1). Three layers:

Deep readiness — each app exposes /api/ready (SELECT 1 per DB → 503 if any is unreachable), distinct from the shallow /api/health liveness. On the admin it must be in proxy.ts PUBLIC_PATHS (the proxy gates all /api/*).
Detect — Prometheus blackbox-ready probes both /api/ready URLs; alert AppDBDown (probe_success{job="blackbox-ready"}==0) → Slack.
Self-heal — obs-db-selfheal Lambda (selfheal-lambda/, no-VPC, EventBridge schedule) redeploys the App Runner service on a 503 → re-fetches the DB secret + rebuilds the pool; heals only on 503 (404/403/timeout → no-op), cooldown-guarded.

Test it: point an app at a bad DB credential (or stop RDS in a test env) → curl -s https://<app>.tlsstress.art/api/ready returns 503 → within a few minutes AppDBDown fires and the reconciler triggers a redeploy. Prevention: apps connect as dedicated non-master roles (octopus_admin_app / tlsstress_app), so RDS master-secret rotation never breaks them — never set an app DATABASE_URL to the master.