Pre-flight checks — validating the lab before runs¶

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. The Pre-flight check engine is a read-only validator that runs against registered DUT API devices BEFORE the operator triggers a Test Plan run. It catches "lab not in expected state" conditions early so runs do not produce forensically worthless data.

The principle: garbage in, garbage out. If the NGFW has pending deploy changes, or the decrypt policy is off when the plan demands decrypt-on, the resulting p99 numbers tell you nothing useful. Pre-flight refuses to start the run instead of letting it produce misleading data.

How it fits in the operator workflow¶

Operator picks plan → Runs preflight → Reviews failures → Fixes lab state →
Triggers snapshot → Re-runs preflight → All green → Starts the actual test run

Pre-flight is manually invoked today (POST endpoint). In a future PR (PR-D), the Test Plan engine will gate run-start on a passing preflight automatically.

API¶

POST /api/test-runs/preflight¶

curl -X POST "https://dashboard.example/api/test-runs/preflight" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "content-type: application/json" \
  -d '{"planIdentifier": "BASELINE-SLO-30M"}'

Response (success — 200):

{
  "planIdentifier": "BASELINE-SLO-30M",
  "ranAt": "2026-05-06T14:35:00.000Z",
  "checksRun": 8,
  "checksPassed": 8,
  "checksFailed": 0,
  "checksSkipped": 0,
  "pass": true,
  "checks": [
    {
      "checkId": "ngfw-deploy-clean",
      "description": "NGFW deploy state is DEPLOYED — no pending config changes",
      "deviceHostname": "ftd-1.lab.example.com",
      "vendor": "cisco-ftd",
      "pass": true,
      "detail": "state=DEPLOYED — no pending changes",
      "evidence": {
        "snapshotId": "...",
        "payloadSha256": "abc123...",
        "collectedAt": "2026-05-06T14:34:42.000Z"
      }
    },
    ...
  ],
  "summary": "8/8 checks passed — lab is ready for plan BASELINE-SLO-30M"
}

Response (any check failed — 422):

{
  "planIdentifier": "BASELINE-SLO-30M",
  "ranAt": "2026-05-06T14:35:00.000Z",
  "checksRun": 8,
  "checksPassed": 6,
  "checksFailed": 1,
  "checksSkipped": 1,
  "pass": false,
  "checks": [
    ...
    {
      "checkId": "ngfw-decrypt-state-matches-plan",
      "deviceHostname": "ftd-1.lab.example.com",
      "vendor": "cisco-ftd",
      "pass": false,
      "detail": "plan requires decrypt-on but no decrypt rules configured",
      "evidence": { ... }
    }
  ],
  "summary": "1 check(s) failed, 1 skipped (missing snapshots) — review details"
}

Check catalog (current)¶

Check ID	Applies to	Endpoint label	What it validates
`ngfw-deploy-clean`	Cisco FTD	`deploy_status`	`state == 'DEPLOYED'` (no pending changes)
`ngfw-decrypt-state-matches-plan`	Cisco FTD + Fortinet	`decrypt_policy`	If plan demands `decrypt-on`, at least one rule configured; if `decrypt-off`, zero rules
`ntp-source-configured`	All vendors	`ntp_config`	Device has at least one NTP server configured
`ngfw-ha-state-sane`	Cisco FTD	`ha_status`	HA state is one of ACTIVE / STANDBY_READY / NEGOTIATION / JOIN
`snapshot-fresh`	All vendors	`system_info`	Latest snapshot is < 60 min old

The catalog is extensible — adding a new check is appending an entry to lib/preflight/checks.ts. No engine changes needed.

What checks return¶

Each check returns one of three states:

Result	Meaning	Operator action
pass: true + evidence	Check evaluated against a snapshot, all good	None — proceed
pass: false + evidence	Check evaluated against a snapshot, FAIL	Fix the device state, trigger a manual snapshot, re-run
pass: false + evidence: null	Skipped — no snapshot exists for this device + label	Trigger a manual snapshot first

Evidence is the most important field — it cites the exact snapshot SHA-256 + collected_at. The same SHA-256 will appear in the Test Run Report annexes when PR-D ships, so the chain-of-custody is unbroken.

Why pre-flight matters¶

Without pre-flight, this is a typical scenario:

Operator triggers BASELINE-SLO-30M expecting decrypt-on. The NGFW had its decrypt policy disabled by another engineer 30 minutes ago. The 30-minute run completes; p99 looks suspiciously low. Operator notices something off only when comparing with last week's run. Run is invalid. Engagement loses 30 min + the credibility of the report.

With pre-flight:

Operator runs preflight first. Check ngfw-decrypt-state-matches-plan fails: "plan requires decrypt-on but no decrypt rules configured". Operator opens the NGFW console (or triggers a write-op via the future API), enables decrypt, snapshot, re-runs preflight, all green, starts the run. 30 minutes are spent on a valid run.

Operator workflow — full sequence¶

# 1. Confirm devices are registered
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  "https://dashboard.example/api/admin/dut/devices"

# 2. Trigger fresh snapshots (so preflight sees current state)
for id in $(curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
  "https://dashboard.example/api/admin/dut/devices" \
  | jq -r '.devices[].id'); do
  curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
    "https://dashboard.example/api/admin/dut/devices/$id/snapshot"
done

# 3. Run pre-flight
curl -X POST "https://dashboard.example/api/test-runs/preflight" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "content-type: application/json" \
  -d '{"planIdentifier": "BASELINE-SLO-30M"}'

# 4. If pass: true → proceed with run trigger (existing flow)
# 5. If pass: false → review checks[].detail, fix the lab, go to step 2

Adding a new check¶

The check catalog at lib/preflight/checks.ts is data-driven. To add a check:

{
  id: 'my-new-check',
  description: 'What this check validates in operator-friendly prose',
  endpointLabel: 'system_info', // which snapshot label the check needs
  appliesTo: (device, plan) => {
    // Return true if this check is relevant for this device + plan combo
    return device.vendor === 'cisco-ftd' && plan.someField === 'someValue';
  },
  evaluate: (snapshot, plan) => {
    // Inspect snapshot.payloadJson and decide pass/fail
    const fieldValue = (snapshot.payloadJson as any)?.someField;
    if (fieldValue === 'expected') {
      return { pass: true, detail: 'value matches expectation' };
    }
    return { pass: false, detail: `value=${fieldValue}, expected 'expected'` };
  }
}

No engine changes needed. The runner discovers the new check automatically.

Limitations¶

Honest scoping:

Read-only — pre-flight does NOT trigger snapshots itself. Operator triggers them via the existing POST /api/admin/dut/devices/{id}/snapshot endpoint
Stale snapshot tolerated up to 60 min — snapshot-fresh check fails if older. Adjust by editing the check, OR run a fresh snapshot before preflight
No automatic run-blocking yet — PR-D will integrate preflight into the Test Plan run-start flow (refuse to start if preflight fails)
Cisco UCS checks not yet wired — UCS adapter is in PR #199 queue. When merged, UCS-specific checks (no critical faults, thermal sane) get added
No write/remediation — pre-flight reports state but does not fix it. F-1 / F-2 (write ops) in the API_FEATURE_CATALOG.md cover that future capability

What pre-flight does NOT replace¶

Operator judgment for non-checkable concerns (cable connections, physical layer, vendor support contracts)
The TLS Decrypt Mode Probe (which is independent of API state — the probe could detect "decrypt is configured but somehow not actually decrypting traffic", which API-only checks cannot)
Time-sync verification (separate check-time-sync.sh script — pre-flight will eventually call it as a check)

DUT_API_INTEGRATION.md — what API integration the checks consume
DUT_API_OPERATIONS.md — how to register devices that pre-flight inspects
API_FEATURE_CATALOG.md — pre-flight checks correspond to category A items A-1 through A-7
TEST_PLANS.md — the plans pre-flight validates against
TIME_SYNC.md — separate time-sync gate; pre-flight will integrate it in a follow-up