Skip to content

ADR 0014 — DOM (DUT Operating Mode) Production-Safe Family

  • Status: Accepted (formalized 2026-05-12 with v3.7.0 — scaffolds shipping in v3.7.0)
  • Date: 2026-05-10
  • Deciders: TLSStress.Art project
  • Targets: v5.0 (Phase 1 Materialization scaffolds: DOM-1..10 already merged Wave 5)
  • Patent claim family: claims #1..#12 (DOM/OOBI/GATEWAY/RELAY)

Context

Through 2026-05 the bench grew from a lab-only TLS decryption tester to a multi-mode platform that customers want to point at production NGFWs. Production traffic is unforgiving: a misclick on a write-mode test can drop live customer flows.

Test types accumulated without a unifying gate: - Branch Office (data plane) - Inspection Profile (5 named profiles + custom 2^10) - SDWAN/CoR (DIA + IPSec) - BGP Saturation (control plane) - MAC/ARP stress (L2 capacity) - PURE Production URL Replay (post-Scope-Freeze)

Each has different production-blast-radius. Current ad-hoc safety is "operator should know better" — insufficient for pilot customers.

Decision

Add a DOM (DUT Operating Mode) discriminator that gates every state-changing operation against the operator's declared mode for that DUT. 5 modes; behavior matrix below.

The 5 modes

Mode Description Destructive ops? DDPB chain enforced?
greenfield New DUT, no production traffic yes no
staging Pre-prod mirror yes (warn-only) partial (warn-only)
lab Dedicated lab DUT yes (warn-only) partial
production Live customer DUT NO — all blocked YES — all 7 layers
prod-partition Operator-quarantined slice partial — explicit unlock yes — bypass requires audit reason

DOM family components

  • DOM-1 — Discriminator: classifies the DUT into one of the 5 modes based on operator declaration + observed signals (live BGP sessions, syslog rate, persona test traffic absence).
  • DOM-2 — CAE (Conditional Automation Engine): per-mode policy loader. Decides which operations are auto-approved vs require unlock vs hard-blocked.
  • DOM-3 — PDD (Production Drift Detection): monitors for mode drift (e.g. operator labeled lab but BGP sessions appear) and alerts.
  • DOM-4 — CPOS schema: customizable per-test parameter overrides.
  • DOM-5 — CPOS atomic 2-phase commit: cross-tier coordinated config push (TierAdapter interface — Personas + DUT + Agents commit together or all rollback). 60s self-healing rollback.
  • DOM-6 — Profile Templates v2.0: full scenario state (config + expectations + thresholds) shareable across teams.
  • DOM-7 — PIE family: RSM (Replay Session Mode) + IR (Idempotent Restore) + HID (Health Indicator Dashboard) + AAE (Auto-Approve Engine).
  • DOM-8 — DUT Permission Indicators: Grafana dashboards visualizing per-DUT mode + active locks + recent CPOS commits.
  • DOM-9 — SPP (Smart Profile Predictor): ML cortex sidecar recommends profile adjustments based on past run telemetry.
  • DOM-10 — DDPB (Defense-in-Depth Production Blocking): 7-layer chain enforcing the production-mode hard-blocks (UI gate + API middleware + DB constraint + K8s admission webhook + RELAY pre-flight + DUT-side BTO check + audit trail before-and-after).

PIE-PA — special case for PURE in production

Per ADR 0021 (PURE), PIE-PA is the 3-layer defense MANDATORY in production-mode to mitigate MITM risk from personas hosting real public IPs (200.130.x.x):

  1. Pod scale-to-0 (bench personas down)
  2. BGP withdraw (bench prefixes withdrawn from upstream)
  3. DNS sanity check (real DNS resolves to real IPs, not bench)

DDPB chain layer 5 (RELAY pre-flight) gates PURE production runs on all 3 PIE-PA layers being green.

Architecture

Mode flow

operator → declares mode for DUT
        ↓
  DOM-1 Discriminator
        ↓
  CAE (DOM-2) loads policy for that mode
        ↓
  every state-changing op → check policy
                          → maybe require unlock
                          → maybe block
                          → always emit audit event

CPOS 2-phase commit (DOM-5)

phase 1 — PREPARE
  TierAdapter[Personas].prepare(payload) → ack/nack
  TierAdapter[DUT].prepare(payload)      → ack/nack
  TierAdapter[Agents].prepare(payload)   → ack/nack
  if any nack → abort, no state change

phase 2 — COMMIT
  TierAdapter[Personas].commit() → snapshot
  TierAdapter[DUT].commit()      → snapshot
  TierAdapter[Agents].commit()   → snapshot
  start 60s self-healing watchdog

watchdog — 60s
  if any tier reports unhealthy → rollback all snapshots
  if all healthy → finalize

Consequences

Pros

  • Production safety: hard-block prevents misclicks on live DUTs
  • Atomic cross-tier config: no partial state across Personas/DUT/Agents
  • Audit trail: every mode transition + override logged for compliance
  • ML cortex (SPP): operators get profile recommendations from past runs
  • Patent moat: DOM/CPOS/PIE family = 12 of 17 claims

Cons / risks

  • DDPB 7 layers = surface area; misconfigured layer = silent block
  • 60s rollback window means transient failures may rollback recoverable config changes
  • ML cortex (SPP) is opt-in; if disabled operators lose recommendation feature

Migration

  • Default DUT mode for existing benches: lab
  • Operators can re-classify per-DUT in /admin/dut//mode
  • DDPB chain rolled out behind a feature flag for first 2 releases

References

  • Memory: discuss_dom_dut_operating_mode_2026_05_10.md
  • Memory: project_pending_master_inventory_2026_05_10.md
  • Code: dashboard/src/lib/dom/ (DOM-1..10 scaffolds, Wave 5)
  • Patent claims: #1..#12 (DOM/OOBI/GATEWAY/RELAY family)