Skip to content

Runbook — Provisioning Stuck (ProvisioningJobsStuck)

Referenced by the ProvisioningJobsStuck Prometheus alert (k8s/74-token-ledger-prometheus-rules.yaml). Severity: critical.

TL;DR

A paying customer has not received their tenant. A settled Stripe checkout was recorded in provisioning_jobs, but the Go orchestrator's HMAC write-back callback never flipped that row to provisioned. The reconcile-provisioning cron normally re-drives such rows automatically; this alert means at least one row is stuck past its retry budget and needs an operator.

Alert ProvisioningJobsStuck
Expr tlsstress_provisioning_stuck > 0 for 30m
Gauge source countStuckProvisioningJobs() via GET /api/metrics
Owner Customer-app / billing on-call
Customer impact Paid customer is stranded — no dpl_ deployment, account not active
Self-healing reconcile-provisioning cron (every 10 min) — only up to MAX_ATTEMPTS

1. What the alert means

The provisioning hand-off is a durable, four-actor flow. Each settled checkout produces exactly one provisioning_jobs row (unique on stripe_session_id).

Actor File Responsibility
Stripe webhook (billing webhook handler) Persists the hand-off via recordProvisioningJob() and calls enqueueOnboarding()
Enqueue hand-off src/lib/provisioning/enqueue.ts HMAC-POSTs OnboardingInput to the orchestrator's /trigger/onboarding
Go orchestrator Temporal OnboardingV1 saga Mints the real dpl_ deployment, provisions the tenant
Write-back callback src/app/api/internal/provisioning/callback/route.ts linkProvisionedAccount() → flips job provisioned + account active
Reconciler (watchdog) src/app/api/cron/reconcile-provisioning/route.ts Re-drives stuck rows on a 10-min cron

A job is STUCK (counted by the gauge) when it is non-terminal (enqueued / provisioning / failed) AND one of:

  • no progress for > 15 minutes (updated_at < now() - interval '15 minutes'), or
  • already at/over the retry cap (attempts >= 8).

See countStuckProvisioningJobs() in src/lib/db/queries.ts. The 15-min / 8-attempt floor exists so a normal in-flight job (a few minutes old) does not trip the gauge.

Why a hand-off gets stuck

  • Trigger unreachablePROVISIONING_TRIGGER_URL / PROVISIONING_TRIGGER_SECRET unset or the orchestrator endpoint down → enqueueOnboarding() returns enqueued:false reason:"trigger_not_configured" (or fetch_error / http_5xx).
  • Orchestrator failed mid-saga — Temporal exhausted its own retries; callback reported status:"failed" (markProvisioningJobFailed()).
  • Callback never landed — orchestrator provisioned but its write-back POST never reached /api/internal/provisioning/callback (network, bad signature → 401, or the callback secret unset → 503).
  • NULL payload — a pre-migration-0039 row has trigger_payload = NULL; the reconciler refuses to guess the package/quota and skips it for manual handling.

2. Operator decision flow

flowchart TD
    A[ProvisioningJobsStuck fired] --> B[Query provisioning_jobs<br/>WHERE status != 'provisioned']
    B --> C{Any rows with<br/>trigger_payload IS NULL?}
    C -- Yes --> D[NULL-payload path<br/>see Section 6 — hand-link manually]
    C -- No --> E{attempts >= 8?}
    E -- No, &lt;8 --> F{last_attempt_at older<br/>than 5 min?}
    F -- No --> G[Reconciler will pick it up<br/>next 10-min tick — WAIT]
    F -- Yes --> H[Force a reconcile run<br/>see Section 4]
    E -- Yes, capped --> I{Is the orchestrator<br/>actually reachable now?}
    I -- No --> J[Fix transport first:<br/>TRIGGER_URL/SECRET, orchestrator health]
    J --> K[Reset attempts, then<br/>force reconcile — Section 5]
    I -- Yes --> L{Did the tenant<br/>actually provision?<br/>check orchestrator/Temporal}
    L -- Yes, dpl_ exists --> M[Hand-link the job<br/>replay the callback — Section 6]
    L -- No --> K
    M --> N[Verify account active + dpl_ set]
    H --> N
    K --> N
    N --> O{Resolved?}
    O -- Yes --> P[Confirm gauge → 0, close alert]
    O -- No --> Q[Escalate — Section 7]

3. Inspect provisioning_jobs

Read-only first. Connect via the in-VPC admin path (App Runner egress is behind Cloudflare). All queries hit the octopus schema on the customer RDS.

3.1 List every non-provisioned job, newest stall first

SELECT
  id,
  account_id,
  stripe_session_id,
  status,
  attempts,
  last_error,
  deployment_id,
  cell_id,
  (trigger_payload IS NULL) AS payload_missing,
  last_attempt_at,
  updated_at,
  created_at
FROM provisioning_jobs
WHERE status <> 'provisioned'
ORDER BY updated_at ASC;

3.2 Reproduce the gauge (exactly what the alert counts)

SELECT count(*) AS stuck
FROM provisioning_jobs
WHERE status IN ('enqueued', 'provisioning', 'failed')
  AND (updated_at < now() - interval '15 minutes' OR attempts >= 8);

This must match tlsstress_provisioning_stuck from GET /api/metrics.

3.3 Triage a single session

SELECT * FROM provisioning_jobs
WHERE stripe_session_id = 'cs_live_...';

Read the columns:

Column What it tells you
status enqueued (handed off, awaiting callback) · provisioning (orchestrator working) · failed (terminal-reported error) · provisioned (DONE)
attempts Re-drive count. >= 8 = past MAX_ATTEMPTS, reconciler has given up
last_error trigger_not_configured · http_5xx · fetch_error · or the orchestrator's error string
deployment_id NULL until callback writes the real dpl_
trigger_payload NULL ⇒ reconciler cannot replay (Section 6)
last_attempt_at Drives the 5-min staleness check the reconciler uses

4. How the reconciler re-drives (and how to force a run)

The reconcile-provisioning cron runs every 10 minutes (K8s CronJob infra/cron/reconcile-provisioning-cronjob.yaml; prod = EventBridge → App Runner). Each tick POSTs /api/cron/reconcile-provisioning with x-cron-secret: $CRON_SECRET.

Per run it:

  1. Selects stale jobs via getStaleProvisioningJobs(): status IN (enqueued, failed, provisioning) AND attempts < 8 AND (last_attempt_at IS NULL OR last_attempt_at < now-5min), oldest first, batch ≤ 25.
  2. For each row with a non-NULL trigger_payload, replays the exact original payload through enqueueOnboarding() (so package_slug → correct token quota — never a guess).
  3. Bumps attempts and records the outcome via recordProvisioningJob() (which never downgrades an already-provisioned row).
  4. A job that later succeeds is flipped to provisioned by the callback and drops out of the stale scan. A job past MAX_ATTEMPTS (8) stops being retried and stays visible on the gauge.

Force a reconcile run now (don't wait for the 10-min tick)

# In-cluster (staging/K8s) — create a one-off Job from the CronJob:
kubectl create job -n customer-app --from=cronjob/reconcile-provisioning \
  reconcile-provisioning-manual-$(date +%s)
kubectl logs -n customer-app -l app.kubernetes.io/name=reconcile-provisioning --tail=50

The endpoint always returns 200 with a JSON summary; read it:

{ "event": "provisioning.reconcile.run", "scanned": 3, "redriven": 2,
  "still_failing": 0, "skipped_no_payload": 1, "stuck_total": 1 }
  • redriven > 0 → hand-offs replayed; wait for the callback to flip them.
  • skipped_no_payload > 0 → NULL-payload rows exist → Section 6.
  • still_failing > 0 → the orchestrator/transport is still down → Section 5/7.

5. Job capped at attempts >= 8 (transport was down)

When the transport (trigger URL/secret or orchestrator) was down through 8 attempts, the reconciler stops re-driving. Fix the transport first, then re-arm the job.

  1. Confirm transport health: verify PROVISIONING_TRIGGER_URL / PROVISIONING_TRIGGER_SECRET are set in the customer-app environment and the orchestrator's /trigger/onboarding answers. If they were unset, the row's last_error will read trigger_not_configured.
  2. Re-arm the capped job so the next reconcile picks it up (reset attempts; keep the payload). Verify the row first, then update by id:
-- Only do this AFTER confirming the transport is healthy again.
UPDATE provisioning_jobs
SET attempts = 0,
    status = 'enqueued',
    last_attempt_at = NULL,
    updated_at = now()
WHERE id = '<job-uuid>'
  AND status <> 'provisioned';   -- never touch a provisioned row
  1. Force a reconcile (Section 4) and watch for the callback to flip it provisioned.

Do not reset attempts before the transport is fixed — you'll just burn the retry budget again and re-trip the alert.


Use this when the orchestrator did provision a real dpl_ tenant but the write-back callback never reached the app, or for a NULL-payload row the reconciler refuses to replay.

6a. Preferred: replay the orchestrator callback (idempotent)

linkProvisionedAccount() is idempotent and is the only code path that correctly links deployment_id + cell_id + sets tokens_quota (positive only) + flips the account to active. Re-POST the same callback the orchestrator would have sent:

BODY='{"stripe_session_id":"cs_live_...","status":"provisioned",
"deployment_id":"dpl_...","cell_id":"cell-...","tokens_quota":1100000}'
SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$PROVISIONING_CALLBACK_SECRET" | awk '{print $2}')
curl -sS -X POST https://app.tlsstress.art/api/internal/provisioning/callback \
  -H "Content-Type: application/json" \
  -H "X-Provisioning-Signature: $SIG" \
  --data "$BODY"
  • 200 {"ok":true,"status":"provisioned"} → linked. The job is provisioned, the account is active, dpl_/cell_id are written. Done.
  • 404 {"error":"no provisioning job for stripe_session_id"} → the webhook never recorded a row → escalate (Section 7), do not invent one.
  • tokens_quota must be the positive quota for the package. linkProvisionedAccount treats 0 as "leave existing quota" (set at checkout by applyTierFromStripe), so passing 0 will not zero a paid account — but pass the real value when known.

Use the dedicated PROVISIONING_CALLBACK_SECRET if set; otherwise the callback also accepts the shared PROVISIONING_TRIGGER_SECRET.

6b. NULL-payload rows (pre-migration 0039)

A row with trigger_payload IS NULL cannot be faithfully replayed by the reconciler (it skips it and logs provisioning.reconcile.skip_no_payload). To resolve:

  1. Recover the package/quota from the Stripe session (stripe checkout sessions retrieve cs_live_...) or from customer_accounts (the tier set at checkout).
  2. If the tenant was already provisioned, hand-link via 6a with the recovered deployment_id / cell_id / tokens_quota.
  3. If it was not provisioned, trigger the orchestrator directly with the recovered OnboardingInput, then let the callback land (or hand-link).

Do not backfill trigger_payload with a guess and let the reconciler replay — a wrong package_slug would mint the wrong token quota. This is exactly why the reconciler skips NULL rows.


7. Escalation

Escalate if any of the following hold:

  • A 200 callback replay (6a) returns 404 → the webhook never recorded the job (signup/billing path bug, not a transport flake).
  • The orchestrator cannot confirm whether a dpl_ was minted (ambiguous saga state).
  • still_failing stays > 0 after the transport is confirmed healthy.
  • More than a handful of jobs are stuck simultaneously (systemic — trigger misconfig or orchestrator outage, not a one-off).

Steps:

  1. Page the billing / customer-app on-call (severity: critical — paid customer is stranded; treat as revenue-impacting).
  2. Capture evidence: the row from §3.3, the reconcile summary JSON (§4), and the orchestrator/Temporal OnboardingV1 workflow state for that stripe_session_id.
  3. If the orchestrator is down, open a P1 against the provisioning orchestrator and hold the rows (they stay visible on the gauge — that is by design).

8. Verify resolution

-- The job should be provisioned with real coordinates:
SELECT status, deployment_id, cell_id, tokens_quota
FROM provisioning_jobs WHERE stripe_session_id = 'cs_live_...';

-- The account should be active and linked to a real dpl_:
SELECT status, deployment_id, tokens_quota
FROM customer_accounts WHERE id = '<account_id>';

Then confirm tlsstress_provisioning_stuck returns to 0 via GET /api/metrics (or re-run the §3.2 query). The alert clears on its own once the gauge hits 0.


  • Reconciler: src/app/api/cron/reconcile-provisioning/route.ts
  • Queries: src/lib/db/queries.ts (getStaleProvisioningJobs, countStuckProvisioningJobs, linkProvisionedAccount, recordProvisioningJob, markProvisioningJobFailed)
  • Write-back callback: src/app/api/internal/provisioning/callback/route.ts
  • Enqueue hand-off: src/lib/provisioning/enqueue.ts
  • Schema: src/lib/db/schema.ts (provisioning_jobs)
  • Alert rule: k8s/74-token-ledger-prometheus-rules.yaml (ProvisioningJobsStuck)
  • CronJob: pkg/octopus/customer-app/infra/cron/reconcile-provisioning-cronjob.yaml