Skip to content

Incident Response Runbook

The single, authoritative procedure for handling production incidents on tlsstress.art — from the first page to the postmortem. It pairs with the observability operations manual (docs/observability/RUNBOOK.md) and the alert rules in observability/cloud/prometheus/alerts.yml, observability/cloud/alertmanager/alertmanager.yml, and k8s/74-token-ledger-prometheus-rules.yaml.

Scope. This runbook covers the money / provisioning / data path: customer signup → Stripe payment → tenant provisioning → token-economy ledger → on-prem operator dashboard, plus the platform that observes them. It does not replace component-specific chaos drills (chaos-*.md) or the DR drill (dr-drill.md); it tells you how to classify, triage, mitigate, communicate and learn from an incident across all of them.

Golden rule. Mitigate the customer-facing impact first, find root cause second. A paid customer with no tenant, a corrupted ledger, or a database the app cannot reach are all SEV-1 until proven otherwise.


1. Severity classification

Severity is set by customer/financial impact, not by which component broke. Set it within the first 5 minutes; it can be raised or lowered as you learn more.

Severity Definition Examples Page? Comms cadence
SEV-1 Money lost/at risk, data corruption, or a paid customer materially broken ProvisioningJobsStuck (paid, no tenant), TokenLedgerOverspentUtxo (ledger corruption), AppDBDown (DB unreachable), RDSStorageCritical, R2BucketPublic Yes (critical pager) Every 30 min
SEV-2 Degraded but no money lost yet; a leading indicator of a SEV-1 RDSConnectionsHigh, RDSStorageLow, TokenLedgerStaleBillingWebhooks, StripeFailedPaymentsSpike, DashboardDBDown Yes if overnight Every 2 h
SEV-3 Localized / cosmetic / single-vantage; no customer impact JourneyVantageChallenged, CloudflareThreatSpike (info), RUMPoorLCP, a single synthetic vantage flaky No At resolution

Severity is a label on the alert too. severity="critical" in Prometheus routes to the pager receiver (hourly re-page + Slack @channel); warning/info land in the #tlsstress-observability firehose. See alertmanager.yml.


2. The on-call flow (triage → mitigate → resolve → postmortem)

flowchart TD
    A([Alert fires / report received]) --> B{Acknowledge<br/>within 5 min}
    B -->|critical pager| C[Open the incident:<br/>set SEV, start a thread]
    B -->|Slack firehose| C
    C --> D{Classify severity<br/>by money/customer impact}
    D -->|SEV-1| E[Declare incident:<br/>IC + scribe + comms]
    D -->|SEV-2/3| F[Single responder]
    E --> G[TRIAGE: read the alert's<br/>runbook_url + golden dashboards]
    F --> G
    G --> H{Root cause known<br/>or mitigation obvious?}
    H -->|No| I[Escalate to domain owner<br/>see §6 escalation]
    I --> G
    H -->|Yes| J[MITIGATE: stop the bleeding<br/>self-heal / rollback / scale]
    J --> K{Customer impact<br/>stopped?}
    K -->|No| L[Raise severity,<br/>widen escalation]
    L --> J
    K -->|Yes| M[RESOLVE: verify alert cleared,<br/>data reconciled, no backlog]
    M --> N{SEV-1 or SEV-2?}
    N -->|Yes| O[POSTMORTEM<br/>within 5 business days]
    N -->|No| P([Close incident])
    O --> Q[Action items tracked<br/>to completion]
    Q --> P

2.1 Triage (first 10 minutes)

  1. Acknowledge the page so the team knows it is owned (silence the pager re-page).
  2. Read the alert. Every critical money/DB alert carries a runbook_url — follow it. The summary and description annotations are written to tell you what broke and the first action.
  3. Open the golden dashboards (…/grafana): cloud-overview, business-revenue, aws-infra, and the operator cockpit for DUT-side incidents. Confirm the alert is real (not a single flaky vantage) and find the blast radius.
  4. Check for a louder signal. A critical often comes with correlated warnings (e.g. RDSConnectionsHighAppDBDown). The inhibit_rules in Alertmanager suppress the derived warnings, so trust the critical.

2.2 Mitigate

Stop customer impact with the fastest safe lever, even if it is temporary: self-heal (let obs-db-selfheal redeploy), roll back the last deploy, scale the resource, or feature-flag the broken path. Document every action in the thread — the postmortem needs the timeline.

2.3 Resolve

An incident is resolved only when all of these hold: - The firing alert has cleared (and stays clear through one repeat_interval). - Customer-facing impact is gone (re-run the relevant synthetic / journey). - Data is reconciled — no stuck provisioning job, no over-spent UTXO, no unprocessed billing webhook backlog. - The mitigation is durable, or a follow-up to make it durable is filed.

2.4 Postmortem

Every SEV-1 and SEV-2 gets a blameless postmortem within 5 business days (template in §8). SEV-3s get a one-line note in the incident thread.


3. Key money / provisioning / data alerts

These are the alerts that, when they fire, mean a customer is or may be paying for nothing — or that the financial ledger is wrong. Treat them as SEV-1 unless triage proves otherwise. Source of truth for the ledger alerts: k8s/74-token-ledger-prometheus-rules.yaml.

3.1 ProvisioningJobsStuck — paid customer, no tenant (SEV-1)

  • Expr: tlsstress_provisioning_stuck > 0 for 30m · severity: critical.
  • Means: one or more provisioning jobs are stuck (>15 min without progress, or ≥8 attempts). A customer who paid did not receive their tenant. This is the worst class of incident: silent money loss with an unhappy customer.
  • First actions:
  • Read the alert's runbook_url (RUNBOOK #provisioning-stuck).
  • Check the Temporal orchestrator for the failed workflow(s) and the reconcile-provisioning cron — is it running and making progress?
  • Identify the affected customer(s) and the failing saga step. Do not blindly re-drive: a naive re-drive can mint a second dpl_ deployment or zero out a paid quota. Re-drive only the failed step.
  • If you cannot provision within the SLA, comms to the customer (§7) and prepare a refund/credit path with finance.

3.2 AppDBDown — the app cannot reach its database (SEV-1)

  • Expr: probe_success{job="blackbox-ready"} == 0 for 2m · severity: critical.
  • Means: the deep /api/ready probe (which runs SELECT 1 on every pool) is failing → the app is up but its DB is unreachable, or the app is down. This is the alert the 2026-06-16 admin outage lacked: the app served a graceful HTTP 200 from the shallow /api/health while its DB was unreachable, so nothing fired for ~33 hours.
  • First actions (full RCA in RUNBOOK #61-db-unreachable):
  • curl -s https://<app>.tlsstress.art/api/ready | jq — see which DB check is ok:false.
  • obs-db-selfheal already redeploys the App Runner service on a 503 (re-fetches the secret + rebuilds the pool). Watch the deployment; if it recovers, you are done.
  • Persists? Verify RDS is available (aws rds describe-db-instances) and that the app's DATABASE_URL authenticates as its dedicated non-master role (octopus_admin_app / tlsstress_app) — never the rotated master tlsstress_admin. RDS rotates the managed master secret every ~7 days; an app pointed at the master will break on every rotation.

The 2026-06-16 lesson, encoded. Apps now connect as dedicated non-master roles, /api/ready is a deep probe, AppDBDown fires in ~2 min, and obs-db-selfheal redeploys automatically. If you ever see "DB indisponível" in an app but blackbox is green, you are looking at a shallow health check — probe /api/ready, not /api/health.

3.3 Ledger integrity — the token-economy alerts (mixed SEV)

The token-economy ledger is the financial source of truth. Its alerts live in k8s/74-token-ledger-prometheus-rules.yaml and scrape app.tlsstress.art/api/metrics (bearer from AWS Secrets Manager tlsstress-phase0/metrics-token).

Alert Expr Sev Means + first action
TokenLedgerOverspentUtxo tlsstress_utxo_overspent > 0 (1m) critical Ledger corruption — a note with spent_amount > amount. The DB CHECK makes this impossible, so a non-zero value is a deep bug. Freeze redemptions, investigate immediately, do not let it propagate.
ProvisioningJobsStuck tlsstress_provisioning_stuck > 0 (30m) critical See §3.1 — paid customer with no tenant.
TokenLedgerScrapeDown absent(tlsstress_tsu_circulating) (10m) warning No ledger metrics for 10m — METRICS_TOKEN rotated without updating the mounted secret, or /api/metrics is down. You are blind to all ledger alerts until fixed.
TokenLedgerStaleBillingWebhooks tlsstress_webhooks_stale_billing > 0 (30m) warning Stripe billing webhooks stuck non-terminal >1h — a credit may not have applied (customer paid, balance not topped up). Check the webhook processor + Stripe dashboard.
TokenLedgerStaleActiveTickets tlsstress_tickets_stale_active > 0 (30m) warning Tickets ACTIVE past expiry+24h — the tlsstress-expire-tickets-hourly sweep is sick.
TokenLedgerOutboxBacklog tlsstress_outbox_pending > 100 (30m) warning Webhook outbox piling up (>100 undelivered) — check deliver-webhooks + receiver health.
TokenLedgerUsageWithoutLiveness tlsstress_usage_without_liveness > 0 (1h) warning A license reported usage in 24h without a heartbeat — possible token tamper / clock-skew / copied-token replay. Billing is covered by L2; this is an integrity signal.

3.4 RDS saturation — page before the 503 (SEV-2 → SEV-1)

AppDBDown only fires once the DB is fully unreachable. The aws-rds-metrics exporter pages on saturation first (thresholds tuned for db.t4g.micro, ~112 max connections, 1 GB RAM — re-tune on resize):

  • RDSConnectionsHigh (> 90, warning) — pool exhaustion approaching; new connections refuse before /api/ready flips. Check PgBouncer / a connection leak.
  • RDSStorageLow (< 2 GB, warning) / RDSStorageCritical (< 1 GB, critical) — RDS hard-stops writes when storage runs out. Grow allocated storage now: aws rds modify-db-instance --allocated-storage.
  • RDSCPUHigh (> 85%, warning) / RDSMemoryLow (< 100 MB, warning) — load pressure / OOM risk.
  • RDSMetricsBlind (rds_metrics_scrape_ok == 0, warning) — the exporter cannot read CloudWatch (expired RO key / IAM / throttling). All RDS alerts above are silent until this recovers.

3.5 DUT-side operator cockpit — DashboardDown / DashboardDBDown (SEV-2)

These fire on the on-prem Prometheus, not the cloud box (observability/prometheus/alerts/web-agent-alerts.yml):

  • DashboardDown (up{job="dashboard"} == 0, critical) — the operator cockpit process/scrape is unreachable. Per CLAUDE.md, the dashboard is the only operator interface, so this blinds the operator.
  • DashboardDBDown (dashboard_db_up == 0, critical) — the cockpit is up but its Postgres is unreachable (the 2026-06-16 silent-DB failure class, mirrored DUT-side). The cockpit shows frozen data and a red "sem dados (DB)" live badge. Check the on-prem PgBouncer / Postgres; the dashboard /api/ready is the deep probe.

3.6 Business / payment signals (SEV-2/3)

From the cloud business group (Stripe RO poller): - StripeFailedPaymentsSpike (> 5 today, warning) — leading indicator of a card/processor issue or fraud. - StripeOpenDisputes (> 0, warning) — chargebacks need a human response inside Stripe's window.


4. The dead-man's-switch (who watches the watcher)

Two meta-alerts keep the platform honest:

  • Watchdog (vector(1), always firing) routes to the deadman receiver → an external healthchecks.io heartbeat. When the heartbeat stops arriving, the external service pages you — i.e. when the whole obs stack is down and can no longer alert. This is the only alert you want to keep firing.
  • ObsComponentDown (up{job=~"grafana|loki|tempo|vector|prometheus|…"} == 0, critical) — an obs component itself is down; the platform may be partially blind.

If you stop getting any alerts and the box looks fine, suspect the alerting path (Slack webhook rotated, Alertmanager wedged) before assuming all is well.


5. First-response cheat sheet

# Cloud obs box (Hetzner)
ssh -i ~/.ssh/tlsstress_f2_hetzner root@89.167.3.1
cd /opt/obs && docker compose ps                 # is the stack healthy?
docker compose logs -f alertmanager              # is alerting flowing?

# Is an app's DB actually reachable? (deep probe — the 2026-06-16 lesson)
curl -s https://app.tlsstress.art/api/ready   | jq
curl -s https://admin.tlsstress.art/api/ready | jq

# Ledger metrics alive? (bearer in AWS SM tlsstress-phase0/metrics-token)
#   tlsstress_tsu_circulating absent  → TokenLedgerScrapeDown
#   tlsstress_provisioning_stuck > 0  → ProvisioningJobsStuck
#   tlsstress_utxo_overspent     > 0  → TokenLedgerOverspentUtxo (corruption)

# RDS state when AppDBDown / RDS* fires
aws rds describe-db-instances --query 'DBInstances[].DBInstanceStatus'

6. Escalation

Escalate when: you cannot mitigate within the first 30 minutes of a SEV-1, the blast radius is growing, the fix needs a privileged action you do not own, or money/data integrity is in doubt.

Domain Signal Owner / path
Provisioning / saga ProvisioningJobsStuck Temporal orchestrator owner; reconcile-provisioning cron; do not blind re-drive
Database / RDS AppDBDown, RDS*, DashboardDBDown DB on-call; in-VPC DDL via the NAT instance (SSM) — open the temp SG ingress to sg-02bf33572b96f2855 + temp secret read, remove both after
Money / ledger TokenLeger*, Stripe* Finance + ledger owner; freeze redemptions on corruption
Edge / Cloudflare EdgeVantageDown, CloudflareThreatSpike Edge owner; WAF / DNS
Backups / DR R2Backup*, R2BucketPublic, R2BucketLockDisabled DR owner; R2BucketPublic is a security SEV-1 — close public access now

The pager. critical alerts re-page hourly and @channel Slack. For guaranteed overnight phone paging, wire the pre-built slot in alertmanager.yml (ntfy.sh $0, PagerDuty free tier, or SMTP) — see RUNBOOK §6.3. Until then a 3 a.m. critical relies on someone watching Slack.


7. Communications

Audience When Channel Content
Internal (incident thread) At declaration, then every cadence (§1) #tlsstress-observability SEV, impact, current hypothesis, next update time
Affected customer(s) SEV-1 with customer impact (e.g. failed provisioning), as soon as scoped Email from @tlsstress.art What is affected, that you are on it, ETA or next update — no internal jargon, no blame
Status page Customer-visible outage https://status.tlsstress.art Plain-language component status; update on mitigate + resolve

Rules of thumb: one incident channel (no parallel threads); the Incident Commander owns the comms cadence; never promise a root cause before you have one; under-promise on ETAs. For provisioning/refund cases, loop in finance early.


8. Postmortem template

Copy this into docs/postmortems/YYYY-MM-DD-<slug>.md. Blameless — focus on systems and signals, never individuals.

# Postmortem — <short title> (<YYYY-MM-DD>)

- **Severity:** SEV-_  · **Duration:** <detect→resolve>  · **Customer impact:** <who/what/$>
- **Detected by:** <alert name | customer report | manual>  · **Detection lag:** <impact-start → first alert>
- **Incident Commander:** <name>  · **Scribe:** <name>

## Summary
<2–3 sentences: what broke, who it affected, how it was resolved.>

## Timeline (UTC)
| Time | Event |
|---|---|
| 03:28 | <e.g. RDS master secret rotated> |
| ...   | <first alert / page> |
| ...   | <mitigation applied> |
| ...   | <resolved> |

## Root cause
<The actual chain of causation. Use "5 whys". Distinguish the trigger from the
underlying condition that made it possible.>

## Detection
<Did the right alert fire? How fast? If detection lagged (the 2026-06-16
~33h-blind case: shallow /api/health, graceful 200, no DB alert), that gap is
itself an action item.>

## Resolution & recovery
<What stopped the bleeding. Was data reconciled (stuck jobs cleared, ledger
consistent, webhook backlog drained)?>

## What went well / what went poorly
- Went well: <e.g. self-heal redeployed automatically>
- Went poorly: <e.g. no phone pager → 4h to acknowledge>

## Action items (owner · due · tracking)
| Action | Owner | Due | Link |
|---|---|---|---|
| <e.g. add deep /api/ready alert> | | | |

## Lessons / preventions
<What systemic change prevents this *class* of incident — not just this instance.>