Runbook — Provisioning Stuck (ProvisioningJobsStuck)¶
Referenced by the
ProvisioningJobsStuckPrometheus alert (k8s/74-token-ledger-prometheus-rules.yaml). Severity: critical.
TL;DR¶
A paying customer has not received their tenant. A settled Stripe checkout was
recorded in provisioning_jobs, but the Go orchestrator's HMAC write-back callback
never flipped that row to provisioned. The reconcile-provisioning cron normally
re-drives such rows automatically; this alert means at least one row is stuck past
its retry budget and needs an operator.
| Alert | ProvisioningJobsStuck |
| Expr | tlsstress_provisioning_stuck > 0 for 30m |
| Gauge source | countStuckProvisioningJobs() via GET /api/metrics |
| Owner | Customer-app / billing on-call |
| Customer impact | Paid customer is stranded — no dpl_ deployment, account not active |
| Self-healing | reconcile-provisioning cron (every 10 min) — only up to MAX_ATTEMPTS |
1. What the alert means¶
The provisioning hand-off is a durable, four-actor flow. Each settled checkout
produces exactly one provisioning_jobs row (unique on stripe_session_id).
| Actor | File | Responsibility |
|---|---|---|
| Stripe webhook | (billing webhook handler) | Persists the hand-off via recordProvisioningJob() and calls enqueueOnboarding() |
| Enqueue hand-off | src/lib/provisioning/enqueue.ts |
HMAC-POSTs OnboardingInput to the orchestrator's /trigger/onboarding |
| Go orchestrator | Temporal OnboardingV1 saga |
Mints the real dpl_ deployment, provisions the tenant |
| Write-back callback | src/app/api/internal/provisioning/callback/route.ts |
linkProvisionedAccount() → flips job provisioned + account active |
| Reconciler (watchdog) | src/app/api/cron/reconcile-provisioning/route.ts |
Re-drives stuck rows on a 10-min cron |
A job is STUCK (counted by the gauge) when it is non-terminal
(enqueued / provisioning / failed) AND one of:
- no progress for > 15 minutes (
updated_at < now() - interval '15 minutes'), or - already at/over the retry cap (
attempts >= 8).
See countStuckProvisioningJobs() in src/lib/db/queries.ts. The 15-min / 8-attempt
floor exists so a normal in-flight job (a few minutes old) does not trip the gauge.
Why a hand-off gets stuck¶
- Trigger unreachable —
PROVISIONING_TRIGGER_URL/PROVISIONING_TRIGGER_SECRETunset or the orchestrator endpoint down →enqueueOnboarding()returnsenqueued:false reason:"trigger_not_configured"(orfetch_error/http_5xx). - Orchestrator failed mid-saga — Temporal exhausted its own retries; callback
reported
status:"failed"(markProvisioningJobFailed()). - Callback never landed — orchestrator provisioned but its write-back POST never
reached
/api/internal/provisioning/callback(network, bad signature → 401, or the callback secret unset → 503). - NULL payload — a pre-migration-0039 row has
trigger_payload = NULL; the reconciler refuses to guess the package/quota and skips it for manual handling.
2. Operator decision flow¶
flowchart TD
A[ProvisioningJobsStuck fired] --> B[Query provisioning_jobs<br/>WHERE status != 'provisioned']
B --> C{Any rows with<br/>trigger_payload IS NULL?}
C -- Yes --> D[NULL-payload path<br/>see Section 6 — hand-link manually]
C -- No --> E{attempts >= 8?}
E -- No, <8 --> F{last_attempt_at older<br/>than 5 min?}
F -- No --> G[Reconciler will pick it up<br/>next 10-min tick — WAIT]
F -- Yes --> H[Force a reconcile run<br/>see Section 4]
E -- Yes, capped --> I{Is the orchestrator<br/>actually reachable now?}
I -- No --> J[Fix transport first:<br/>TRIGGER_URL/SECRET, orchestrator health]
J --> K[Reset attempts, then<br/>force reconcile — Section 5]
I -- Yes --> L{Did the tenant<br/>actually provision?<br/>check orchestrator/Temporal}
L -- Yes, dpl_ exists --> M[Hand-link the job<br/>replay the callback — Section 6]
L -- No --> K
M --> N[Verify account active + dpl_ set]
H --> N
K --> N
N --> O{Resolved?}
O -- Yes --> P[Confirm gauge → 0, close alert]
O -- No --> Q[Escalate — Section 7]
3. Inspect provisioning_jobs¶
Read-only first. Connect via the in-VPC admin path (App Runner egress is behind Cloudflare). All queries hit the
octopusschema on the customer RDS.
3.1 List every non-provisioned job, newest stall first¶
SELECT
id,
account_id,
stripe_session_id,
status,
attempts,
last_error,
deployment_id,
cell_id,
(trigger_payload IS NULL) AS payload_missing,
last_attempt_at,
updated_at,
created_at
FROM provisioning_jobs
WHERE status <> 'provisioned'
ORDER BY updated_at ASC;
3.2 Reproduce the gauge (exactly what the alert counts)¶
SELECT count(*) AS stuck
FROM provisioning_jobs
WHERE status IN ('enqueued', 'provisioning', 'failed')
AND (updated_at < now() - interval '15 minutes' OR attempts >= 8);
This must match tlsstress_provisioning_stuck from GET /api/metrics.
3.3 Triage a single session¶
SELECT * FROM provisioning_jobs
WHERE stripe_session_id = 'cs_live_...';
Read the columns:
| Column | What it tells you |
|---|---|
status |
enqueued (handed off, awaiting callback) · provisioning (orchestrator working) · failed (terminal-reported error) · provisioned (DONE) |
attempts |
Re-drive count. >= 8 = past MAX_ATTEMPTS, reconciler has given up |
last_error |
trigger_not_configured · http_5xx · fetch_error · or the orchestrator's error string |
deployment_id |
NULL until callback writes the real dpl_ |
trigger_payload |
NULL ⇒ reconciler cannot replay (Section 6) |
last_attempt_at |
Drives the 5-min staleness check the reconciler uses |
4. How the reconciler re-drives (and how to force a run)¶
The reconcile-provisioning cron runs every 10 minutes (K8s CronJob
infra/cron/reconcile-provisioning-cronjob.yaml; prod = EventBridge → App Runner).
Each tick POSTs /api/cron/reconcile-provisioning with x-cron-secret: $CRON_SECRET.
Per run it:
- Selects stale jobs via
getStaleProvisioningJobs():status IN (enqueued, failed, provisioning)ANDattempts < 8AND (last_attempt_at IS NULLORlast_attempt_at < now-5min), oldest first, batch ≤ 25. - For each row with a non-NULL
trigger_payload, replays the exact original payload throughenqueueOnboarding()(sopackage_slug→ correct token quota — never a guess). - Bumps
attemptsand records the outcome viarecordProvisioningJob()(which never downgrades an already-provisionedrow). - A job that later succeeds is flipped to
provisionedby the callback and drops out of the stale scan. A job pastMAX_ATTEMPTS(8) stops being retried and stays visible on the gauge.
Force a reconcile run now (don't wait for the 10-min tick)¶
# In-cluster (staging/K8s) — create a one-off Job from the CronJob:
kubectl create job -n customer-app --from=cronjob/reconcile-provisioning \
reconcile-provisioning-manual-$(date +%s)
kubectl logs -n customer-app -l app.kubernetes.io/name=reconcile-provisioning --tail=50
The endpoint always returns 200 with a JSON summary; read it:
{ "event": "provisioning.reconcile.run", "scanned": 3, "redriven": 2,
"still_failing": 0, "skipped_no_payload": 1, "stuck_total": 1 }
redriven> 0 → hand-offs replayed; wait for the callback to flip them.skipped_no_payload> 0 → NULL-payload rows exist → Section 6.still_failing> 0 → the orchestrator/transport is still down → Section 5/7.
5. Job capped at attempts >= 8 (transport was down)¶
When the transport (trigger URL/secret or orchestrator) was down through 8 attempts, the reconciler stops re-driving. Fix the transport first, then re-arm the job.
- Confirm transport health: verify
PROVISIONING_TRIGGER_URL/PROVISIONING_TRIGGER_SECRETare set in the customer-app environment and the orchestrator's/trigger/onboardinganswers. If they were unset, the row'slast_errorwill readtrigger_not_configured. - Re-arm the capped job so the next reconcile picks it up (reset attempts; keep the payload). Verify the row first, then update by id:
-- Only do this AFTER confirming the transport is healthy again.
UPDATE provisioning_jobs
SET attempts = 0,
status = 'enqueued',
last_attempt_at = NULL,
updated_at = now()
WHERE id = '<job-uuid>'
AND status <> 'provisioned'; -- never touch a provisioned row
- Force a reconcile (Section 4) and watch for the callback to flip it
provisioned.
Do not reset attempts before the transport is fixed — you'll just burn the retry budget again and re-trip the alert.
6. Hand-link a stuck job (callback never landed) & NULL-payload rows¶
Use this when the orchestrator did provision a real dpl_ tenant but the
write-back callback never reached the app, or for a NULL-payload row the reconciler
refuses to replay.
6a. Preferred: replay the orchestrator callback (idempotent)¶
linkProvisionedAccount() is idempotent and is the only code path that correctly
links deployment_id + cell_id + sets tokens_quota (positive only) + flips the
account to active. Re-POST the same callback the orchestrator would have sent:
BODY='{"stripe_session_id":"cs_live_...","status":"provisioned",
"deployment_id":"dpl_...","cell_id":"cell-...","tokens_quota":1100000}'
SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$PROVISIONING_CALLBACK_SECRET" | awk '{print $2}')
curl -sS -X POST https://app.tlsstress.art/api/internal/provisioning/callback \
-H "Content-Type: application/json" \
-H "X-Provisioning-Signature: $SIG" \
--data "$BODY"
200 {"ok":true,"status":"provisioned"}→ linked. The job isprovisioned, the account isactive,dpl_/cell_idare written. Done.404 {"error":"no provisioning job for stripe_session_id"}→ the webhook never recorded a row → escalate (Section 7), do not invent one.tokens_quotamust be the positive quota for the package.linkProvisionedAccounttreats0as "leave existing quota" (set at checkout byapplyTierFromStripe), so passing0will not zero a paid account — but pass the real value when known.
Use the dedicated
PROVISIONING_CALLBACK_SECRETif set; otherwise the callback also accepts the sharedPROVISIONING_TRIGGER_SECRET.
6b. NULL-payload rows (pre-migration 0039)¶
A row with trigger_payload IS NULL cannot be faithfully replayed by the reconciler
(it skips it and logs provisioning.reconcile.skip_no_payload). To resolve:
- Recover the package/quota from the Stripe session (
stripe checkout sessions retrieve cs_live_...) or fromcustomer_accounts(the tier set at checkout). - If the tenant was already provisioned, hand-link via 6a with the recovered
deployment_id/cell_id/tokens_quota. - If it was not provisioned, trigger the orchestrator directly with the recovered
OnboardingInput, then let the callback land (or hand-link).
Do not backfill
trigger_payloadwith a guess and let the reconciler replay — a wrongpackage_slugwould mint the wrong token quota. This is exactly why the reconciler skips NULL rows.
7. Escalation¶
Escalate if any of the following hold:
- A
200callback replay (6a) returns404→ the webhook never recorded the job (signup/billing path bug, not a transport flake). - The orchestrator cannot confirm whether a
dpl_was minted (ambiguous saga state). still_failingstays > 0 after the transport is confirmed healthy.- More than a handful of jobs are stuck simultaneously (systemic — trigger misconfig or orchestrator outage, not a one-off).
Steps:
- Page the billing / customer-app on-call (
severity: critical— paid customer is stranded; treat as revenue-impacting). - Capture evidence: the row from §3.3, the reconcile summary JSON (§4), and the
orchestrator/Temporal
OnboardingV1workflow state for thatstripe_session_id. - If the orchestrator is down, open a P1 against the provisioning orchestrator and hold the rows (they stay visible on the gauge — that is by design).
8. Verify resolution¶
-- The job should be provisioned with real coordinates:
SELECT status, deployment_id, cell_id, tokens_quota
FROM provisioning_jobs WHERE stripe_session_id = 'cs_live_...';
-- The account should be active and linked to a real dpl_:
SELECT status, deployment_id, tokens_quota
FROM customer_accounts WHERE id = '<account_id>';
Then confirm tlsstress_provisioning_stuck returns to 0 via GET /api/metrics
(or re-run the §3.2 query). The alert clears on its own once the gauge hits 0.
Related¶
- Reconciler:
src/app/api/cron/reconcile-provisioning/route.ts - Queries:
src/lib/db/queries.ts(getStaleProvisioningJobs,countStuckProvisioningJobs,linkProvisionedAccount,recordProvisioningJob,markProvisioningJobFailed) - Write-back callback:
src/app/api/internal/provisioning/callback/route.ts - Enqueue hand-off:
src/lib/provisioning/enqueue.ts - Schema:
src/lib/db/schema.ts(provisioning_jobs) - Alert rule:
k8s/74-token-ledger-prometheus-rules.yaml(ProvisioningJobsStuck) - CronJob:
pkg/octopus/customer-app/infra/cron/reconcile-provisioning-cronjob.yaml