Runbook — Provisioning Stuck (`ProvisioningJobsStuck`)¶

Referenced by the ProvisioningJobsStuck Prometheus alert (k8s/74-token-ledger-prometheus-rules.yaml). Severity: critical.

TL;DR¶

A paying customer has not received their tenant. A settled Stripe checkout was recorded in provisioning_jobs, but the Go orchestrator's HMAC write-back callback never flipped that row to provisioned. The reconcile-provisioning cron normally re-drives such rows automatically; this alert means at least one row is stuck past its retry budget and needs an operator.


Alert	`ProvisioningJobsStuck`
Expr	`tlsstress_provisioning_stuck > 0` for `30m`
Gauge source	`countStuckProvisioningJobs()` via `GET /api/metrics`
Owner	Customer-app / billing on-call
Customer impact	Paid customer is stranded — no `dpl_` deployment, account not `active`
Self-healing	`reconcile-provisioning` cron (every 10 min) — only up to `MAX_ATTEMPTS`

1. What the alert means¶

The provisioning hand-off is a durable, four-actor flow. Each settled checkout produces exactly one provisioning_jobs row (unique on stripe_session_id).

Actor	File	Responsibility
Stripe webhook	(billing webhook handler)	Persists the hand-off via `recordProvisioningJob()` and calls `enqueueOnboarding()`
Enqueue hand-off	`src/lib/provisioning/enqueue.ts`	HMAC-POSTs `OnboardingInput` to the orchestrator's `/trigger/onboarding`
Go orchestrator	Temporal `OnboardingV1` saga	Mints the real `dpl_` deployment, provisions the tenant
Write-back callback	`src/app/api/internal/provisioning/callback/route.ts`	`linkProvisionedAccount()` → flips job `provisioned` + account `active`
Reconciler (watchdog)	`src/app/api/cron/reconcile-provisioning/route.ts`	Re-drives stuck rows on a 10-min cron

A job is STUCK (counted by the gauge) when it is non-terminal (enqueued / provisioning / failed) AND one of:

no progress for > 15 minutes (updated_at < now() - interval '15 minutes'), or
already at/over the retry cap (attempts >= 8).

See countStuckProvisioningJobs() in src/lib/db/queries.ts. The 15-min / 8-attempt floor exists so a normal in-flight job (a few minutes old) does not trip the gauge.

Why a hand-off gets stuck¶

Trigger unreachable — PROVISIONING_TRIGGER_URL / PROVISIONING_TRIGGER_SECRET unset or the orchestrator endpoint down → enqueueOnboarding() returns enqueued:false reason:"trigger_not_configured" (or fetch_error / http_5xx).
Orchestrator failed mid-saga — Temporal exhausted its own retries; callback reported status:"failed" (markProvisioningJobFailed()).
Callback never landed — orchestrator provisioned but its write-back POST never reached /api/internal/provisioning/callback (network, bad signature → 401, or the callback secret unset → 503).
NULL payload — a pre-migration-0039 row has trigger_payload = NULL; the reconciler refuses to guess the package/quota and skips it for manual handling.

2. Operator decision flow¶

flowchart TD
    A[ProvisioningJobsStuck fired] --> B[Query provisioning_jobs<br/>WHERE status != 'provisioned']
    B --> C{Any rows with<br/>trigger_payload IS NULL?}
    C -- Yes --> D[NULL-payload path<br/>see Section 6 — hand-link manually]
    C -- No --> E{attempts >= 8?}
    E -- No, &lt;8 --> F{last_attempt_at older<br/>than 5 min?}
    F -- No --> G[Reconciler will pick it up<br/>next 10-min tick — WAIT]
    F -- Yes --> H[Force a reconcile run<br/>see Section 4]
    E -- Yes, capped --> I{Is the orchestrator<br/>actually reachable now?}
    I -- No --> J[Fix transport first:<br/>TRIGGER_URL/SECRET, orchestrator health]
    J --> K[Reset attempts, then<br/>force reconcile — Section 5]
    I -- Yes --> L{Did the tenant<br/>actually provision?<br/>check orchestrator/Temporal}
    L -- Yes, dpl_ exists --> M[Hand-link the job<br/>replay the callback — Section 6]
    L -- No --> K
    M --> N[Verify account active + dpl_ set]
    H --> N
    K --> N
    N --> O{Resolved?}
    O -- Yes --> P[Confirm gauge → 0, close alert]
    O -- No --> Q[Escalate — Section 7]

3. Inspect `provisioning_jobs`¶

Read-only first. Connect via the in-VPC admin path (App Runner egress is behind Cloudflare). All queries hit the octopus schema on the customer RDS.

3.1 List every non-provisioned job, newest stall first¶

SELECT
  id,
  account_id,
  stripe_session_id,
  status,
  attempts,
  last_error,
  deployment_id,
  cell_id,
  (trigger_payload IS NULL) AS payload_missing,
  last_attempt_at,
  updated_at,
  created_at
FROM provisioning_jobs
WHERE status <> 'provisioned'
ORDER BY updated_at ASC;

3.2 Reproduce the gauge (exactly what the alert counts)¶

SELECT count(*) AS stuck
FROM provisioning_jobs
WHERE status IN ('enqueued', 'provisioning', 'failed')
  AND (updated_at < now() - interval '15 minutes' OR attempts >= 8);

This must match tlsstress_provisioning_stuck from GET /api/metrics.

3.3 Triage a single session¶

SELECT * FROM provisioning_jobs
WHERE stripe_session_id = 'cs_live_...';

Read the columns:

Column	What it tells you
`status`	`enqueued` (handed off, awaiting callback) · `provisioning` (orchestrator working) · `failed` (terminal-reported error) · `provisioned` (DONE)
`attempts`	Re-drive count. `>= 8` = past `MAX_ATTEMPTS`, reconciler has given up
`last_error`	`trigger_not_configured` · `http_5xx` · `fetch_error` · or the orchestrator's error string
`deployment_id`	`NULL` until callback writes the real `dpl_`
`trigger_payload`	`NULL` ⇒ reconciler cannot replay (Section 6)
`last_attempt_at`	Drives the 5-min staleness check the reconciler uses

4. How the reconciler re-drives (and how to force a run)¶

The reconcile-provisioning cron runs every 10 minutes (K8s CronJob infra/cron/reconcile-provisioning-cronjob.yaml; prod = EventBridge → App Runner). Each tick POSTs /api/cron/reconcile-provisioning with x-cron-secret: $CRON_SECRET.

Per run it:

Selects stale jobs via getStaleProvisioningJobs(): status IN (enqueued, failed, provisioning) AND attempts < 8 AND (last_attempt_at IS NULL OR last_attempt_at < now-5min), oldest first, batch ≤ 25.
For each row with a non-NULL trigger_payload, replays the exact original payload through enqueueOnboarding() (so package_slug → correct token quota — never a guess).
Bumps attempts and records the outcome via recordProvisioningJob() (which never downgrades an already-provisioned row).
A job that later succeeds is flipped to provisioned by the callback and drops out of the stale scan. A job past MAX_ATTEMPTS (8) stops being retried and stays visible on the gauge.

Force a reconcile run now (don't wait for the 10-min tick)¶

# In-cluster (staging/K8s) — create a one-off Job from the CronJob:
kubectl create job -n customer-app --from=cronjob/reconcile-provisioning \
  reconcile-provisioning-manual-$(date +%s)
kubectl logs -n customer-app -l app.kubernetes.io/name=reconcile-provisioning --tail=50

The endpoint always returns 200 with a JSON summary; read it:

{ "event": "provisioning.reconcile.run", "scanned": 3, "redriven": 2,
  "still_failing": 0, "skipped_no_payload": 1, "stuck_total": 1 }

redriven > 0 → hand-offs replayed; wait for the callback to flip them.
skipped_no_payload > 0 → NULL-payload rows exist → Section 6.
still_failing > 0 → the orchestrator/transport is still down → Section 5/7.

5. Job capped at `attempts >= 8` (transport was down)¶

When the transport (trigger URL/secret or orchestrator) was down through 8 attempts, the reconciler stops re-driving. Fix the transport first, then re-arm the job.

Confirm transport health: verify PROVISIONING_TRIGGER_URL / PROVISIONING_TRIGGER_SECRET are set in the customer-app environment and the orchestrator's /trigger/onboarding answers. If they were unset, the row's last_error will read trigger_not_configured.
Re-arm the capped job so the next reconcile picks it up (reset attempts; keep the payload). Verify the row first, then update by id:

-- Only do this AFTER confirming the transport is healthy again.
UPDATE provisioning_jobs
SET attempts = 0,
    status = 'enqueued',
    last_attempt_at = NULL,
    updated_at = now()
WHERE id = '<job-uuid>'
  AND status <> 'provisioned';   -- never touch a provisioned row

Force a reconcile (Section 4) and watch for the callback to flip it provisioned.

Do not reset attempts before the transport is fixed — you'll just burn the retry budget again and re-trip the alert.

6. Hand-link a stuck job (callback never landed) & NULL-payload rows¶

Use this when the orchestrator did provision a real dpl_ tenant but the write-back callback never reached the app, or for a NULL-payload row the reconciler refuses to replay.

6a. Preferred: replay the orchestrator callback (idempotent)¶

linkProvisionedAccount() is idempotent and is the only code path that correctly links deployment_id + cell_id + sets tokens_quota (positive only) + flips the account to active. Re-POST the same callback the orchestrator would have sent:

BODY='{"stripe_session_id":"cs_live_...","status":"provisioned",
"deployment_id":"dpl_...","cell_id":"cell-...","tokens_quota":1100000}'
SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$PROVISIONING_CALLBACK_SECRET" | awk '{print $2}')
curl -sS -X POST https://app.tlsstress.art/api/internal/provisioning/callback \
  -H "Content-Type: application/json" \
  -H "X-Provisioning-Signature: $SIG" \
  --data "$BODY"

200 {"ok":true,"status":"provisioned"} → linked. The job is provisioned, the account is active, dpl_/cell_id are written. Done.
404 {"error":"no provisioning job for stripe_session_id"} → the webhook never recorded a row → escalate (Section 7), do not invent one.
tokens_quota must be the positive quota for the package. linkProvisionedAccount treats 0 as "leave existing quota" (set at checkout by applyTierFromStripe), so passing 0 will not zero a paid account — but pass the real value when known.

Use the dedicated PROVISIONING_CALLBACK_SECRET if set; otherwise the callback also accepts the shared PROVISIONING_TRIGGER_SECRET.

6b. NULL-payload rows (pre-migration 0039)¶

A row with trigger_payload IS NULL cannot be faithfully replayed by the reconciler (it skips it and logs provisioning.reconcile.skip_no_payload). To resolve:

Recover the package/quota from the Stripe session (stripe checkout sessions retrieve cs_live_...) or from customer_accounts (the tier set at checkout).
If the tenant was already provisioned, hand-link via 6a with the recovered deployment_id / cell_id / tokens_quota.
If it was not provisioned, trigger the orchestrator directly with the recovered OnboardingInput, then let the callback land (or hand-link).

Do not backfill trigger_payload with a guess and let the reconciler replay — a wrong package_slug would mint the wrong token quota. This is exactly why the reconciler skips NULL rows.

7. Escalation¶

Escalate if any of the following hold:

A 200 callback replay (6a) returns 404 → the webhook never recorded the job (signup/billing path bug, not a transport flake).
The orchestrator cannot confirm whether a dpl_ was minted (ambiguous saga state).
still_failing stays > 0 after the transport is confirmed healthy.
More than a handful of jobs are stuck simultaneously (systemic — trigger misconfig or orchestrator outage, not a one-off).

Steps:

Page the billing / customer-app on-call (severity: critical — paid customer is stranded; treat as revenue-impacting).
Capture evidence: the row from §3.3, the reconcile summary JSON (§4), and the orchestrator/Temporal OnboardingV1 workflow state for that stripe_session_id.
If the orchestrator is down, open a P1 against the provisioning orchestrator and hold the rows (they stay visible on the gauge — that is by design).

8. Verify resolution¶

-- The job should be provisioned with real coordinates:
SELECT status, deployment_id, cell_id, tokens_quota
FROM provisioning_jobs WHERE stripe_session_id = 'cs_live_...';

-- The account should be active and linked to a real dpl_:
SELECT status, deployment_id, tokens_quota
FROM customer_accounts WHERE id = '<account_id>';

Then confirm tlsstress_provisioning_stuck returns to 0 via GET /api/metrics (or re-run the §3.2 query). The alert clears on its own once the gauge hits 0.

Reconciler: src/app/api/cron/reconcile-provisioning/route.ts
Queries: src/lib/db/queries.ts (getStaleProvisioningJobs, countStuckProvisioningJobs, linkProvisionedAccount, recordProvisioningJob, markProvisioningJobFailed)
Write-back callback: src/app/api/internal/provisioning/callback/route.ts
Enqueue hand-off: src/lib/provisioning/enqueue.ts
Schema: src/lib/db/schema.ts (provisioning_jobs)
Alert rule: k8s/74-token-ledger-prometheus-rules.yaml (ProvisioningJobsStuck)
CronJob: pkg/octopus/customer-app/infra/cron/reconcile-provisioning-cronjob.yaml

Runbook — Provisioning Stuck (ProvisioningJobsStuck)¶