GUIDES — How-to for the Observability Platform¶
Task recipes. Architecture: HLD · components: LLD · operations: RUNBOOK · SSO: SSO-AUTH0. Edit files under
observability/cloud/in the repo, then deploy per RUNBOOK §1.
G1. Add a synthetic target (monitor a new URL)¶
Edit observability/cloud/prometheus/prometheus.yml, job blackbox-http
(public sites, strict 2xx) or blackbox-edge (infra, allows 4xx):
static_configs:
- targets:
- https://app.tlsstress.art
- https://your-new-url.tlsstress.art # ← add
Deploy + reload: scp … && docker compose kill -s SIGHUP prometheus. It appears
in Cloud — Overview and SLO dashboards automatically (they template by
instance).
G2. Add an alert rule¶
Append to observability/cloud/prometheus/alerts.yml under a group:
- alert: MyAlert
expr: <promql> > <threshold>
for: 10m
labels: { severity: warning }
annotations: { summary: "..." }
Deploy + docker compose kill -s SIGHUP prometheus. Check
…/grafana → Alerting, or http://prometheus:9090/rules (internal). Wire a
contact point (Slack/email/webhook) in Grafana → Alerting → Contact points.
G3. Add a dashboard¶
Drop a Grafana dashboard JSON into observability/cloud/grafana/dashboards/.
Rules: unique uid, datasource refs by uid (prometheus/loki/tempo/cloudwatch),
schemaVersion: 39. Validate locally: python3 -c "import json;json.load(open('x.json'))".
scp to /opt/obs/grafana/dashboards/ — provisioner picks it up in ~10s.
To export one you built in the UI: Dashboard → Share → Export → Save to file,
then commit it (set a stable uid).
CloudWatch panels: each target needs the full builder schema —
metricQueryType:0,metricEditorMode:0,queryMode:"Metrics", explicitregion(not"default"),dimensions:{Key:["*"]},matchExact:false. Without them Grafana 11 renders "No data" (it defaults to Metrics-Insights SQL mode). Copy a target fromaws-infra.json.
G4. Add a data source¶
Edit observability/cloud/grafana/provisioning/datasources/datasources.yml,
add an entry, scp, docker compose up -d --force-recreate grafana. Secrets via
secureJsonData sourced from env (never inline).
G5. Activate Plausible (when profitable)¶
Plausible is wired but off (its API needs a paid plan). When tlsstress.art
is profitable:
- Upgrade the Plausible plan; create a Stats API key.
- On the box
.env:PLAUSIBLE_API_KEY=<key> PLAUSIBLE_SITE_IDS=tlsstress.art,app.tlsstress.art docker compose up -d --force-recreate cloud-poller.- The Business — Revenue & Growth + Cloud — Overview "Product/funnel"
panels light up;
plausible_scrape_up→ 1.
G6. Activate Stripe monitoring¶
- Stripe Dashboard → Developers → API keys → Create restricted key
→ set every resource to Read (no write). Copy
rk_live_…. .env:STRIPE_API_KEY=rk_live_…→up -d --force-recreate cloud-poller.- Business — Revenue & Growth populates (MRR, subs, volume, disputes);
stripe_scrape_up→ 1.
Use a dedicated read-only key, not the app's live secret key. Amounts are in cents (dashboards divide by 100).
G7. Activate AWS CloudWatch — ✅ LIVE¶
Done in production. IAM user obs-cloudwatch-ro (CloudWatchReadOnlyAccess),
keys in Secrets Manager tlsstress-obs/aws-ro, datasource health OK. The
AWS — Infra dashboard (aws-infra) shows App Runner (reqs/5xx/latency/CPU/
mem), RDS (CPU/conns/storage) and Lambda crons (invocations/errors/duration).
To re-create from scratch:
1. IAM → user with CloudWatchReadOnlyAccess only → access key.
2. .env: AWS_RO_ACCESS_KEY_ID / AWS_RO_SECRET_ACCESS_KEY / AWS_REGION.
3. up -d --force-recreate grafana.
The datasource uses
authType: keys+secureJsonData+defaultRegion: $AWS_REGION. Don't use${VAR:-default}in provisioning (not interpolated → "missing default region").
G8. Wire logs (AWS → Loki)¶
- CloudWatch Logs → log group → Subscription filter → Kinesis Firehose.
- Firehose HTTP endpoint destination:
https://obs.tlsstress.art/ingest, access key =INGEST_TOKEN(sent asX-Amz-Firehose-Access-Key). - Add the token check in
caddy/Caddyfile/ingesthandler before going live. - Logs land in Loki; the Cloud — Overview "auth stream" panel filters
{stack="cloud", security="auth"}(signin/signup/admin.login).
G9. Wire traces (apps → Tempo)¶
The OTLP ingress is live and token-gated at https://obs.tlsstress.art/otlp
(Caddy → Tempo OTLP-HTTP; 401 without the bearer). Token is in Secrets Manager
tlsstress-obs/otlp-token. The apps already export OTLP when
OTEL_EXPORTER_OTLP_ENDPOINT is set (customer-app/src/lib/observability/otel.ts).
To activate, set on each App Runner service (runtime env — gated):
OTEL_EXPORTER_OTLP_ENDPOINT = https://obs.tlsstress.art/otlp
OTEL_EXPORTER_OTLP_HEADERS = Authorization=Bearer <tlsstress-obs/otlp-token>
Open-sink protection: the ingress requires the bearer; never expose Tempo 4317/4318 publicly without it.
G10. PromQL / LogQL cookbook¶
# availability SLI (30d) for a site
avg_over_time(probe_success{instance="https://app.tlsstress.art"}[30d]) * 100
# error-budget burn (1h) vs 99.9% SLO
(1 - avg_over_time(probe_success[1h])) / (1 - 0.999)
# p95 latency
quantile_over_time(0.95, probe_duration_seconds[1h])
# container CPU cores
sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[5m]))
# MRR (BRL)
sum(stripe_mrr_cents)/100
# cache hit ratio
cf_cached_requests_total / clamp_min(cf_requests_total,1)
# auth events (signin/signup/admin)
{stack="cloud", security="auth"}
# errors across the cloud stream
{stack="cloud"} |= "error" | json
# CloudTrail audit (SIEM)
{source_app="cloudtrail"}
G11. RUM — frontend (Faro) — ✅ LIVE¶
Shipped: a FaroInit client component (dynamic import → lazy chunk) is wired into
customer-app and admin-console (deployed) and marketing (merged,
pending a Cloudflare deploy). It auto-points at the prod collector on
*.tlsstress.art hosts; data lands in Loki {source_app="rum"}. CSP
connect-src allows https://obs.tlsstress.art in each app. Reference pattern
(src/components/FaroInit.tsx):
import { initializeFaro } from '@grafana/faro-web-sdk';
initializeFaro({ url: 'https://obs.tlsstress.art/faro/collect', app: { name: 'customer-app' } });
source_app="rum").
CORS is already allowed for the app domains (see alloy/config.alloy).
G12. Paging (Alertmanager → Slack) — ✅ LIVE¶
Prometheus alerts route to Slack via the default receiver's slack_configs
(rich format, colour by severity). The incoming-webhook URL lives in
alertmanager/secrets/slack_webhook (gitignored). To (re)point it:
printf '%s' 'https://hooks.slack.com/services/XXX' > /opt/obs/alertmanager/secrets/slack_webhook
chmod 644 /opt/obs/alertmanager/secrets/slack_webhook # container is non-root
cd /opt/obs && docker compose restart alertmanager
⚠️ A Slack incoming webhook needs the
{"text":…}shape — useslack_configs(Alertmanager renders it), not the genericwebhook_configs(posts raw JSON Slack rejects). Test path:POST /api/v2/alertsthen checkalertmanager_notifications_total{integration="slack"}climbs,failedflat.
Routing lives in routes: (split critical→pager, warning→chat later). For
PagerDuty/Opsgenie use a pagerduty_configs/opsgenie_configs receiver; for
Discord, discord_configs. The Watchdog stays on the deadman route
(healthchecks.io), not Slack.
G13. Dead-man's-switch — ✅ LIVE¶
The Watchdog alert always fires and is routed (active deadman route in
alertmanager/alertmanager.yml) to a healthchecks.io heartbeat every ~1 min;
if the pings stop (whole box down), healthchecks.io pages you. To (re)point it:
1. Create a free check at healthchecks.io (Period 5m / Grace 5m) → copy its ping URL.
2. printf '%s' '<ping-url>' > /opt/obs/alertmanager/secrets/deadman_url
3. chmod 644 /opt/obs/alertmanager/secrets/* — ⚠️ the Alertmanager container
runs as non-root; url_files at mode 600 (root-only) fail with
read url_file: permission denied. 644 is fine (single-tenant box).
4. docker compose restart alertmanager.
Verify: curl -s http://alertmanager:9093/metrics | grep notifications_.*webhook
— notifications_total should climb while failed_total stays flat.
G15. AWS cron-Lambda monitoring (healthchecks.io, $0, no Lambda change)¶
aws-cron-monitor (obs box, boto3 + RO key) reads each cron Lambda's CloudWatch
Invocations/Errors and pings a per-cron healthchecks.io check: ran clean → ping;
errors → /fail; didn't run → no ping → healthchecks alerts on the missed
schedule. 8 crons monitored (schedules from EventBridge Scheduler). Config
(function → ping_url + lookback) is exporters/aws-cron-checks.json — box-only,
gitignored (contains ping URLs); created via the Management API.
- Add a cron: create a check (POST /api/v1/checks/ with the EventBridge cron as
schedule), add {function: {url, lookback}} to the config, restart the service.
- Healthchecks status badges (signed public SVGs from GET /api/v1/badges/)
are embedded on the landing (landing/index.html).
G14. Continuous profiling (Pyroscope)¶
Live: Alloy scrapes the stack's own pprof → Pyroscope (Grafana → Explore →
Pyroscope datasource). To profile the SaaS Go services, either push with the
Pyroscope Go SDK or add their pprof endpoints to alloy/config.alloy's
pyroscope.scrape targets.
G16. Public status page (/status) — ✅ LIVE¶
obs.tlsstress.art/status is a branded, customer-facing page: per-surface uptime
(24h/7d/30d), the 30d SLA + error budget, and live status. It is derived from
the same blackbox/SLO data as the operator dashboards (single source of truth).
exporters/status_gen.py(stdlib, thestatus-genservice) queries Prometheus everySTATUS_INTERVALand writesstatus/status.json.status/index.htmlrenders it client-side (auto-refresh 30s); Caddy serves it at the vanity domainstatus.tlsstress.art(own LE cert) and atobs.tlsstress.art/status.- Add a surface: edit the
SERVICESlist instatus_gen.py,docker compose restart status-gen. - Vanity domain (
status.tlsstress.art) — live: a Cloudflare DNS-only (grey-cloud) A recordstatus→ the box IP (so Caddy can ACME), plus a dedicatedstatus.tlsstress.artsite block in the Caddyfile that serves/srv/statusat its apex (the brand mark is served from/srv/landing).
G17. Multi-vantage synthetics — ✅ box live; 🟢/🟡 external one-step¶
No single-VM blind spot: the same surfaces are probed from several vantages,
results land under job=synthetic-edge (external) and synthetic-journey (box),
viewable in Synthetics — Vantages.
| Vantage | What | Activate |
|---|---|---|
hel1-box |
k6 multi-step journey (synthetics/journey.js, k6-journey service) |
live |
br-selfhosted |
GitHub self-hosted runners (Brazil, residential → reaches origin) push via synthetic-multivantage.yml |
set repo secret SYNTHETIC_PUSH_TOKEN (done) |
GRU (colo) |
Cloudflare edge worker (synthetics/edge-worker/) — runs inside CF, reaches the CF-fronted origin |
live (deployed; repo secrets CLOUDFLARE_API_TOKEN/_ACCOUNT_ID drive edge-synthetic-deploy.yml) |
Cloudflare managed challenge: the box is a datacenter IP and gets a 403 "Just a moment" (
cf-mitigated: challenge) on CF-fronted surfaces — this is Bot Fight Mode (free plan), which cannot be skipped by a WAF rule on free. So the box journey classifies CF surfaces aschallenged(not down) and verifies non-CF surfaces (F2) for real. The edge worker is the answer: it runs inside Cloudflare, so it reaches the origin (verified:app/tlsstress→ 200 from colo GRU). To ALSO let the datacenter box reach the origin, add a CF IP Access Rule → Allow for the box IP (Dashboard → Security → WAF → Tools, or a token with Firewall Services:Edit) — Allow rules skip Bot Fight Mode. Push ingress: Caddy/synthetic-push→ Pushgateway, gated byAuthorization: Bearer $SYNTHETIC_PUSH_TOKEN; Prometheus scrapes withhonor_labelssoregion/targetsurvive.
G18. Security/log alerting (Loki ruler) — ✅ LIVE¶
The Loki ruler evaluates LogQL alert rules over the SIEM stream and fires to the
same Alertmanager → Slack. Rules: loki/rules/fake/security.yml (tenant is
fake — auth_enabled: false). Live: auth-failure spike/flood, AWS root
activity, unauthorized burst, IAM/access-key/security-group change.
- Add a rule: append to security.yml (Prometheus-style alert rule with a
LogQL expr), docker compose restart loki. Confirm:
curl -H "X-Scope-OrgID: fake" loki:3100/loki/api/v1/rules.
G19. Deploy markers (CI → Grafana annotations) — ✅ LIVE¶
Each customer-app / admin-console App Runner deploy posts a Grafana annotation
(tags deploy,<app>) via the composite action .github/actions/grafana-annotate,
so a latency/error regression lines up with the release that caused it. Token:
repo secret GRAFANA_ANNOTATION_TOKEN (a Grafana Editor service-account token).
- Annotate from another workflow: add a final step
uses: ./.github/actions/grafana-annotate with app, version, token.
- Rotate the token: Grafana → Administration → Service accounts →
ci-deploy-annotator → add token → update the repo secret.
G20. Predictive alerts — ✅ LIVE¶
The predictive group (prometheus/alerts.yml) alerts on the trend before the
threshold: HostDiskWillFillSoon (predict_linear → 0 within ~4h),
HostMemoryWillExhaust (~2h), TLSCertRenewalDue (<30d, gentler than the 14d
critical path). Add one with predict_linear(<gauge>[<window>], <horizon_s>) < 0.
G21. Off-box backup → Cloudflare R2 ($0, 10 GB free) — ✅ LIVE¶
Off-site DR copy of everything not in git/Secrets Manager. The backup-r2
service (exporters/backup_r2.py) runs daily and uploads to r2://$R2_BUCKET/:
obs-box/ (Hetzner state volumes) · cloudflare/ (DNS/WAF/Workers/page-rules) ·
stripe/ (full RO export) · aws/ (Terraform state + secrets manifest) · tbi/
(client images mirrored from S3) · github/ (3 repos as full-history bundles) ·
auth0/ (tenant config). Size-aware retention (age cap KEEP_DAILY + hard
R2_MAX_BYTES ≈ 9 GB + a R2_TBI_MAX_BYTES TBI cap) means it can never exceed the
free tier (~1.2 GB used). A healthchecks.io dead-man's switch pages on a missed/failed
run. The RDS pg_dump can't run on the box (RDS is private) → it runs as a
scheduled in-VPC container Lambda obs-rds-backup-r2 (backup-rds-lambda/ +
exporters/backup_aws_r2.py, EventBridge every 6h) → aws/<date>/rds-<db>-<HHMMSSZ>.sql.gz
for every DB (postgres + octopus_admin). The whole R2 bucket is immutable
(R2 Bucket Lock, 7d — anti-ransomware: a stolen upload credential can't delete/encrypt
backups), and per-source success/freshness is tracked (backup_r2_source_* +
R2SourceBackup* alerts). Restore: DR-RESTORE-RUNBOOK.
Credential scopes (least-privilege). Each source uses a dedicated read-only credential: R2 (derived from the CF token), a read-only AWS IAM user (
obs-backup-ro) for S3/SM, a read-only GitHub PAT, a read-only Stripe restricted key, a read-only Auth0 M2M app, and the CF token for DNS/WAF/Workers. Two items need broader read scopes for full coverage (the backup is fail-soft, so the rest still runs): the CF token needs Zone Settings:Read forzone_settings, and the Auth0 M2M app needsread:clients,read:tenant_settings,read:organizations,read:branding,read:prompts(today it has connections/resource-servers/actions/rules/roles).
Activation (one time):
1. Enable R2 on the Cloudflare account (dashboard → R2 → Get Started; free tier =
$0 but requires a card on file). Until enabled, the per-account S3 endpoint
<account>.r2.cloudflarestorage.com does not serve TLS.
2. Create an R2 API Token (Admin Read & Write) → Access Key ID + Secret. (For a
CF API token with R2 perms, the S3 creds are derivable: access_key_id = token
id, secret_access_key = SHA-256 of the token value.)
3. .env: R2_ENDPOINT (https://<account>.r2.cloudflarestorage.com),
R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_BUCKET, BACKUP_HEALTHCHECK_URL,
CF_BACKUP_TOKEN/CF_ACCOUNT_ID/CF_ZONE_ID.
4. docker compose up -d --force-recreate backup-r2, then a one-shot dry run:
docker compose exec -T backup-r2 python /app/backup_r2.py --once.
5. Verify objects: aws --endpoint-url $R2 --region auto s3 ls --recursive
s3://$R2_BUCKET/. Create a healthchecks.io check (Period 1d / Grace 6h) and set
BACKUP_HEALTHCHECK_URL.
boto3 ≥1.36 adds request checksums some S3-compatibles reject — if uploads fail with a checksum error, set
AWS_REQUEST_CHECKSUM_CALCULATION=when_required.
G22. App DB readiness + self-healing — ✅ LIVE¶
Closes the 2026-06-16 blind spot where a DB outage hid behind a graceful /api/health
200 (see RUNBOOK §6.1). Three layers:
- Deep readiness — each app exposes
/api/ready(SELECT 1per DB → 503 if any is unreachable), distinct from the shallow/api/healthliveness. On the admin it must be inproxy.tsPUBLIC_PATHS(the proxy gates all/api/*). - Detect — Prometheus
blackbox-readyprobes both/api/readyURLs; alertAppDBDown(probe_success{job="blackbox-ready"}==0) → Slack. - Self-heal —
obs-db-selfhealLambda (selfheal-lambda/, no-VPC, EventBridge schedule) redeploys the App Runner service on a 503 → re-fetches the DB secret + rebuilds the pool; heals only on 503 (404/403/timeout → no-op), cooldown-guarded.
Test it: point an app at a bad DB credential (or stop RDS in a test env) →
curl -s https://<app>.tlsstress.art/api/ready returns 503 → within a few minutes
AppDBDown fires and the reconciler triggers a redeploy. Prevention: apps connect as
dedicated non-master roles (octopus_admin_app / tlsstress_app), so RDS master-secret
rotation never breaks them — never set an app DATABASE_URL to the master.