Incident Response Runbook¶
The single, authoritative procedure for handling production incidents on tlsstress.art — from the first page to the postmortem. It pairs with the observability operations manual (
docs/observability/RUNBOOK.md) and the alert rules inobservability/cloud/prometheus/alerts.yml,observability/cloud/alertmanager/alertmanager.yml, andk8s/74-token-ledger-prometheus-rules.yaml.
Scope. This runbook covers the money / provisioning / data path: customer
signup → Stripe payment → tenant provisioning → token-economy ledger →
on-prem operator dashboard, plus the platform that observes them. It does not
replace component-specific chaos drills (chaos-*.md) or the DR drill
(dr-drill.md); it tells you how to classify, triage, mitigate, communicate
and learn from an incident across all of them.
Golden rule. Mitigate the customer-facing impact first, find root cause second. A paid customer with no tenant, a corrupted ledger, or a database the app cannot reach are all SEV-1 until proven otherwise.
1. Severity classification¶
Severity is set by customer/financial impact, not by which component broke. Set it within the first 5 minutes; it can be raised or lowered as you learn more.
| Severity | Definition | Examples | Page? | Comms cadence |
|---|---|---|---|---|
| SEV-1 | Money lost/at risk, data corruption, or a paid customer materially broken | ProvisioningJobsStuck (paid, no tenant), TokenLedgerOverspentUtxo (ledger corruption), AppDBDown (DB unreachable), RDSStorageCritical, R2BucketPublic |
Yes (critical pager) | Every 30 min |
| SEV-2 | Degraded but no money lost yet; a leading indicator of a SEV-1 | RDSConnectionsHigh, RDSStorageLow, TokenLedgerStaleBillingWebhooks, StripeFailedPaymentsSpike, DashboardDBDown |
Yes if overnight | Every 2 h |
| SEV-3 | Localized / cosmetic / single-vantage; no customer impact | JourneyVantageChallenged, CloudflareThreatSpike (info), RUMPoorLCP, a single synthetic vantage flaky |
No | At resolution |
Severity is a label on the alert too. severity="critical" in Prometheus
routes to the pager receiver (hourly re-page + Slack @channel);
warning/info land in the #tlsstress-observability firehose. See
alertmanager.yml.
2. The on-call flow (triage → mitigate → resolve → postmortem)¶
flowchart TD
A([Alert fires / report received]) --> B{Acknowledge<br/>within 5 min}
B -->|critical pager| C[Open the incident:<br/>set SEV, start a thread]
B -->|Slack firehose| C
C --> D{Classify severity<br/>by money/customer impact}
D -->|SEV-1| E[Declare incident:<br/>IC + scribe + comms]
D -->|SEV-2/3| F[Single responder]
E --> G[TRIAGE: read the alert's<br/>runbook_url + golden dashboards]
F --> G
G --> H{Root cause known<br/>or mitigation obvious?}
H -->|No| I[Escalate to domain owner<br/>see §6 escalation]
I --> G
H -->|Yes| J[MITIGATE: stop the bleeding<br/>self-heal / rollback / scale]
J --> K{Customer impact<br/>stopped?}
K -->|No| L[Raise severity,<br/>widen escalation]
L --> J
K -->|Yes| M[RESOLVE: verify alert cleared,<br/>data reconciled, no backlog]
M --> N{SEV-1 or SEV-2?}
N -->|Yes| O[POSTMORTEM<br/>within 5 business days]
N -->|No| P([Close incident])
O --> Q[Action items tracked<br/>to completion]
Q --> P
2.1 Triage (first 10 minutes)¶
- Acknowledge the page so the team knows it is owned (silence the pager re-page).
- Read the alert. Every critical money/DB alert carries a
runbook_url— follow it. Thesummaryanddescriptionannotations are written to tell you what broke and the first action. - Open the golden dashboards (
…/grafana):cloud-overview,business-revenue,aws-infra, and the operator cockpit for DUT-side incidents. Confirm the alert is real (not a single flaky vantage) and find the blast radius. - Check for a louder signal. A
criticaloften comes with correlatedwarnings (e.g.RDSConnectionsHigh→AppDBDown). Theinhibit_rulesin Alertmanager suppress the derived warnings, so trust the critical.
2.2 Mitigate¶
Stop customer impact with the fastest safe lever, even if it is temporary:
self-heal (let obs-db-selfheal redeploy), roll back the last deploy, scale the
resource, or feature-flag the broken path. Document every action in the thread —
the postmortem needs the timeline.
2.3 Resolve¶
An incident is resolved only when all of these hold:
- The firing alert has cleared (and stays clear through one repeat_interval).
- Customer-facing impact is gone (re-run the relevant synthetic / journey).
- Data is reconciled — no stuck provisioning job, no over-spent UTXO, no
unprocessed billing webhook backlog.
- The mitigation is durable, or a follow-up to make it durable is filed.
2.4 Postmortem¶
Every SEV-1 and SEV-2 gets a blameless postmortem within 5 business days (template in §8). SEV-3s get a one-line note in the incident thread.
3. Key money / provisioning / data alerts¶
These are the alerts that, when they fire, mean a customer is or may be paying
for nothing — or that the financial ledger is wrong. Treat them as SEV-1 unless
triage proves otherwise. Source of truth for the ledger alerts:
k8s/74-token-ledger-prometheus-rules.yaml.
3.1 ProvisioningJobsStuck — paid customer, no tenant (SEV-1)¶
- Expr:
tlsstress_provisioning_stuck > 0for 30m · severity: critical. - Means: one or more provisioning jobs are stuck (>15 min without progress, or ≥8 attempts). A customer who paid did not receive their tenant. This is the worst class of incident: silent money loss with an unhappy customer.
- First actions:
- Read the alert's
runbook_url(RUNBOOK#provisioning-stuck). - Check the Temporal orchestrator for the failed workflow(s) and the
reconcile-provisioningcron — is it running and making progress? - Identify the affected customer(s) and the failing saga step. Do not
blindly re-drive: a naive re-drive can mint a second
dpl_deployment or zero out a paid quota. Re-drive only the failed step. - If you cannot provision within the SLA, comms to the customer (§7) and prepare a refund/credit path with finance.
3.2 AppDBDown — the app cannot reach its database (SEV-1)¶
- Expr:
probe_success{job="blackbox-ready"} == 0for 2m · severity: critical. - Means: the deep
/api/readyprobe (which runsSELECT 1on every pool) is failing → the app is up but its DB is unreachable, or the app is down. This is the alert the 2026-06-16 admin outage lacked: the app served a gracefulHTTP 200from the shallow/api/healthwhile its DB was unreachable, so nothing fired for ~33 hours. - First actions (full RCA in RUNBOOK
#61-db-unreachable): curl -s https://<app>.tlsstress.art/api/ready | jq— see which DB check isok:false.obs-db-selfhealalready redeploys the App Runner service on a 503 (re-fetches the secret + rebuilds the pool). Watch the deployment; if it recovers, you are done.- Persists? Verify RDS is
available(aws rds describe-db-instances) and that the app'sDATABASE_URLauthenticates as its dedicated non-master role (octopus_admin_app/tlsstress_app) — never the rotated mastertlsstress_admin. RDS rotates the managed master secret every ~7 days; an app pointed at the master will break on every rotation.
The 2026-06-16 lesson, encoded. Apps now connect as dedicated non-master roles,
/api/readyis a deep probe,AppDBDownfires in ~2 min, andobs-db-selfhealredeploys automatically. If you ever see "DB indisponível" in an app but blackbox is green, you are looking at a shallow health check — probe/api/ready, not/api/health.
3.3 Ledger integrity — the token-economy alerts (mixed SEV)¶
The token-economy ledger is the financial source of truth. Its alerts live in
k8s/74-token-ledger-prometheus-rules.yaml and scrape
app.tlsstress.art/api/metrics (bearer from AWS Secrets Manager
tlsstress-phase0/metrics-token).
| Alert | Expr | Sev | Means + first action |
|---|---|---|---|
TokenLedgerOverspentUtxo |
tlsstress_utxo_overspent > 0 (1m) |
critical | Ledger corruption — a note with spent_amount > amount. The DB CHECK makes this impossible, so a non-zero value is a deep bug. Freeze redemptions, investigate immediately, do not let it propagate. |
ProvisioningJobsStuck |
tlsstress_provisioning_stuck > 0 (30m) |
critical | See §3.1 — paid customer with no tenant. |
TokenLedgerScrapeDown |
absent(tlsstress_tsu_circulating) (10m) |
warning | No ledger metrics for 10m — METRICS_TOKEN rotated without updating the mounted secret, or /api/metrics is down. You are blind to all ledger alerts until fixed. |
TokenLedgerStaleBillingWebhooks |
tlsstress_webhooks_stale_billing > 0 (30m) |
warning | Stripe billing webhooks stuck non-terminal >1h — a credit may not have applied (customer paid, balance not topped up). Check the webhook processor + Stripe dashboard. |
TokenLedgerStaleActiveTickets |
tlsstress_tickets_stale_active > 0 (30m) |
warning | Tickets ACTIVE past expiry+24h — the tlsstress-expire-tickets-hourly sweep is sick. |
TokenLedgerOutboxBacklog |
tlsstress_outbox_pending > 100 (30m) |
warning | Webhook outbox piling up (>100 undelivered) — check deliver-webhooks + receiver health. |
TokenLedgerUsageWithoutLiveness |
tlsstress_usage_without_liveness > 0 (1h) |
warning | A license reported usage in 24h without a heartbeat — possible token tamper / clock-skew / copied-token replay. Billing is covered by L2; this is an integrity signal. |
3.4 RDS saturation — page before the 503 (SEV-2 → SEV-1)¶
AppDBDown only fires once the DB is fully unreachable. The aws-rds-metrics
exporter pages on saturation first (thresholds tuned for db.t4g.micro,
~112 max connections, 1 GB RAM — re-tune on resize):
RDSConnectionsHigh(> 90, warning) — pool exhaustion approaching; new connections refuse before/api/readyflips. Check PgBouncer / a connection leak.RDSStorageLow(< 2 GB, warning) /RDSStorageCritical(< 1 GB, critical) — RDS hard-stops writes when storage runs out. Grow allocated storage now:aws rds modify-db-instance --allocated-storage.RDSCPUHigh(> 85%, warning) /RDSMemoryLow(< 100 MB, warning) — load pressure / OOM risk.RDSMetricsBlind(rds_metrics_scrape_ok == 0, warning) — the exporter cannot read CloudWatch (expired RO key / IAM / throttling). All RDS alerts above are silent until this recovers.
3.5 DUT-side operator cockpit — DashboardDown / DashboardDBDown (SEV-2)¶
These fire on the on-prem Prometheus, not the cloud box
(observability/prometheus/alerts/web-agent-alerts.yml):
DashboardDown(up{job="dashboard"} == 0, critical) — the operator cockpit process/scrape is unreachable. Per CLAUDE.md, the dashboard is the only operator interface, so this blinds the operator.DashboardDBDown(dashboard_db_up == 0, critical) — the cockpit is up but its Postgres is unreachable (the 2026-06-16 silent-DB failure class, mirrored DUT-side). The cockpit shows frozen data and a red "sem dados (DB)" live badge. Check the on-prem PgBouncer / Postgres; the dashboard/api/readyis the deep probe.
3.6 Business / payment signals (SEV-2/3)¶
From the cloud business group (Stripe RO poller):
- StripeFailedPaymentsSpike (> 5 today, warning) — leading indicator of a
card/processor issue or fraud.
- StripeOpenDisputes (> 0, warning) — chargebacks need a human response
inside Stripe's window.
4. The dead-man's-switch (who watches the watcher)¶
Two meta-alerts keep the platform honest:
Watchdog(vector(1), always firing) routes to thedeadmanreceiver → an external healthchecks.io heartbeat. When the heartbeat stops arriving, the external service pages you — i.e. when the whole obs stack is down and can no longer alert. This is the only alert you want to keep firing.ObsComponentDown(up{job=~"grafana|loki|tempo|vector|prometheus|…"} == 0, critical) — an obs component itself is down; the platform may be partially blind.
If you stop getting any alerts and the box looks fine, suspect the alerting path (Slack webhook rotated, Alertmanager wedged) before assuming all is well.
5. First-response cheat sheet¶
# Cloud obs box (Hetzner)
ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1
cd /opt/obs && docker compose ps # is the stack healthy?
docker compose logs -f alertmanager # is alerting flowing?
# Is an app's DB actually reachable? (deep probe — the 2026-06-16 lesson)
curl -s https://app.tlsstress.art/api/ready | jq
curl -s https://admin.tlsstress.art/api/ready | jq
# Ledger metrics alive? (bearer in AWS SM tlsstress-phase0/metrics-token)
# tlsstress_tsu_circulating absent → TokenLedgerScrapeDown
# tlsstress_provisioning_stuck > 0 → ProvisioningJobsStuck
# tlsstress_utxo_overspent > 0 → TokenLedgerOverspentUtxo (corruption)
# RDS state when AppDBDown / RDS* fires
aws rds describe-db-instances --query 'DBInstances[].DBInstanceStatus'
6. Escalation¶
Escalate when: you cannot mitigate within the first 30 minutes of a SEV-1, the blast radius is growing, the fix needs a privileged action you do not own, or money/data integrity is in doubt.
| Domain | Signal | Owner / path |
|---|---|---|
| Provisioning / saga | ProvisioningJobsStuck |
Temporal orchestrator owner; reconcile-provisioning cron; do not blind re-drive |
| Database / RDS | AppDBDown, RDS*, DashboardDBDown |
DB on-call; in-VPC DDL via the NAT instance (SSM) — open the temp SG ingress to sg-02bf33572b96f2855 + temp secret read, remove both after |
| Money / ledger | TokenLeger*, Stripe* |
Finance + ledger owner; freeze redemptions on corruption |
| Edge / Cloudflare | EdgeVantageDown, CloudflareThreatSpike |
Edge owner; WAF / DNS |
| Backups / DR | R2Backup*, R2BucketPublic, R2BucketLockDisabled |
DR owner; R2BucketPublic is a security SEV-1 — close public access now |
The pager.
criticalalerts re-page hourly and@channelSlack. For guaranteed overnight phone paging, wire the pre-built slot inalertmanager.yml(ntfy.sh$0, PagerDuty free tier, or SMTP) — see RUNBOOK §6.3. Until then a 3 a.m. critical relies on someone watching Slack.
7. Communications¶
| Audience | When | Channel | Content |
|---|---|---|---|
| Internal (incident thread) | At declaration, then every cadence (§1) | #tlsstress-observability |
SEV, impact, current hypothesis, next update time |
| Affected customer(s) | SEV-1 with customer impact (e.g. failed provisioning), as soon as scoped | Email from @tlsstress.art |
What is affected, that you are on it, ETA or next update — no internal jargon, no blame |
| Status page | Customer-visible outage | https://status.tlsstress.art |
Plain-language component status; update on mitigate + resolve |
Rules of thumb: one incident channel (no parallel threads); the Incident Commander owns the comms cadence; never promise a root cause before you have one; under-promise on ETAs. For provisioning/refund cases, loop in finance early.
8. Postmortem template¶
Copy this into docs/postmortems/YYYY-MM-DD-<slug>.md. Blameless — focus on
systems and signals, never individuals.
# Postmortem — <short title> (<YYYY-MM-DD>)
- **Severity:** SEV-_ · **Duration:** <detect→resolve> · **Customer impact:** <who/what/$>
- **Detected by:** <alert name | customer report | manual> · **Detection lag:** <impact-start → first alert>
- **Incident Commander:** <name> · **Scribe:** <name>
## Summary
<2–3 sentences: what broke, who it affected, how it was resolved.>
## Timeline (UTC)
| Time | Event |
|---|---|
| 03:28 | <e.g. RDS master secret rotated> |
| ... | <first alert / page> |
| ... | <mitigation applied> |
| ... | <resolved> |
## Root cause
<The actual chain of causation. Use "5 whys". Distinguish the trigger from the
underlying condition that made it possible.>
## Detection
<Did the right alert fire? How fast? If detection lagged (the 2026-06-16
~33h-blind case: shallow /api/health, graceful 200, no DB alert), that gap is
itself an action item.>
## Resolution & recovery
<What stopped the bleeding. Was data reconciled (stuck jobs cleared, ledger
consistent, webhook backlog drained)?>
## What went well / what went poorly
- Went well: <e.g. self-heal redeployed automatically>
- Went poorly: <e.g. no phone pager → 4h to acknowledge>
## Action items (owner · due · tracking)
| Action | Owner | Due | Link |
|---|---|---|---|
| <e.g. add deep /api/ready alert> | | | |
## Lessons / preventions
<What systemic change prevents this *class* of incident — not just this instance.>
9. Related runbooks & references¶
- Observability operations:
docs/observability/RUNBOOK.md(§6 incidents, §6.1 DB-unreachable, §6.2 RDS, §6.3 pager) - Ledger alerts:
k8s/74-token-ledger-prometheus-rules.yaml - Cloud alert rules + routing:
observability/cloud/prometheus/alerts.yml,observability/cloud/alertmanager/alertmanager.yml - DUT-side dashboard alerts:
observability/prometheus/alerts/web-agent-alerts.yml - DR drill:
docs/runbooks/dr-drill.md· Restore:docs/runbooks/restore-from-backup.md - Architecture:
docs/ARCHITECTURE.md