Skip to content

ADR 0011 — Topology axes: deployment nodes × L2 fabric × DUT type

  • Status: Accepted
  • Date: 2026-05-08
  • Deciders: TLSStress.Art project
  • Targets: v4.5 (this work — formalizes axes + makes Nexus 9000 optional)

Context

Until v4.4, the test bench's deployment configuration mixed three independent concepts into a single "deployment mode" enum:

  • single-node (1 UCS, all roles colocated)
  • dual-node (2 UCS, agents vs personas split)
  • tri-node (3 UCS, Playwright vs k6 vs personas split)
  • multi-node (4+ UCS, full role separation)

Implicit assumptions baked into that enum:

  • L2 fabric is always Cisco Nexus 9000. Every deployment script, preflight check, runbook, and report annex assumes the Nexus is present. scripts/nexus/01-apply-tuning.nxos is referenced from scripts/k8s-dut-up.sh. Annex G layer 5 attests Nexus BPDU isolation. The dashboard preflight check airgapL2BpduIsolation queries Nexus state via SSH.
  • DUT is always a Cisco FTD. v4.5 introduces Inspection Profile methodology that re-uses this assumption (FMC + FDM apply scripts) while v4.6 will add Cisco Catalyst Secure Router (Catalyst SD-WAN Manager, vManage REST API).

Two real-world deployment shapes break those assumptions:

  1. Single-node lab without external switch. The smallest production deployment — 1 UCS, 1 NGFW, no Nexus. The Nexus does not exist; any layer 5 BPDU attestation that requires it must skip cleanly. Operator wires the UCS NICs directly to the NGFW interfaces. This is the user's actual current setup as of 2026-05-08.

  2. Multi-node deployments where the operator owns a different L2 switch. Nexus is one option among many; a dual-node deployment in a customer rack might use a Catalyst 9300, Arista 7050, or a small-form-factor switch. The bench should not require a Nexus — only require the network properties Nexus provides (VLAN trunking, BPDU isolation at the boundary, MACsec on uplinks, etc.) which can be obtained from any modern switch.

Coupling all of this into a single deployment-mode enum makes the configuration brittle: every new permutation invents a new enum value, the matrix of supported combinations is implicit, and the Nexus-specific optimizations (scripts/nexus/01-apply-tuning.nxos) cannot be turned off without forking the install script.

Decision

Decompose deployment configuration into three independent axes:

Axis Values What it controls
deployment_nodes single · dual · tri · multi Number of UCS hosts and how roles are distributed (existing)
l2_fabric nexus · none · (future: arista, catalyst, generic) What L2 switch (if any) carries inter-host traffic
dut_type cisco-ftd · cisco-secure-router · (future: palo-alto, fortinet, ...) Which security DUT is under test; gates which apply/verify scripts run

Each axis is independent — any value of any axis composes with any value of the others. The matrix is explicit (see §"Permitted combinations" below) so unsupported permutations fail loudly at config-load time rather than silently producing broken state.

Single source of truth: platform/topology.yaml

# platform/topology.yaml — TLSStress.Art deployment topology declaration.
# Single source of truth for the three axes; consumed at every layer
# (install scripts, dashboard preflight, report cover, runbooks).
version: 1
deployment_nodes: single        # single | dual | tri | multi
l2_fabric: none                 # nexus | none (future: arista | catalyst | generic)
dut_type: cisco-ftd             # cisco-ftd | cisco-secure-router (future: more)

# Per-axis details (only the relevant fields are read for the chosen value)
single_node:
  cabling: multi-nic-trunk      # multi-nic-trunk | single-nic-trunk
  # When `multi-nic-trunk`: each physical NIC on the UCS carries an
  # 802.1q-tagged trunk to a different NGFW interface. This distributes
  # bandwidth across NICs and isolates traffic classes (agents vs
  # personas) onto different physical links. Default and recommended.

l2_fabric_nexus:
  # Only consulted when l2_fabric == nexus.
  apply_tuning_script: scripts/nexus/01-apply-tuning.nxos
  verify_script: scripts/nexus/02-verify.nxos
  bpdu_classification: scripts/nexus/PORT_CLASSIFICATION.md

dut_cisco_ftd:
  # Only consulted when dut_type == cisco-ftd. (FMC vs FDM credentials
  # belong here, populated by the operator at install time.)
  manager: fmc                  # fmc | fdm

dut_cisco_secure_router:
  # Only consulted when dut_type == cisco-secure-router.
  manager: catalyst-sd-wan-manager   # always vManage; field is reserved

The file is loaded at dashboard startup by dashboard/src/lib/topology/loader.ts, validated with Zod, and materialized into a Kubernetes ConfigMap (topology in web-agents namespace) so cluster-side scripts and DaemonSets read the same values. Operator never edits the ConfigMap directly — edit topology.yaml and re-run install.

Permitted combinations

Any (deployment_nodes, l2_fabric, dut_type) triple is permitted unless explicitly forbidden below. The dashboard refuses to start with an unsupported combination and prints which axis is the problem.

deployment_nodes l2_fabric OK? Reason if no
single none Default. UCS NICs cabled directly to NGFW.
single nexus Permitted but unusual — operator chose to put a Nexus in front of a single-host lab. The bench will configure Nexus tuning + BPDU isolation as if it were a multi-node deployment. No technical block; flag for operator review.
dual / tri / multi nexus Standard production shape.
dual / tri / multi none Operator wires inter-host traffic via direct cabling or via a switch the bench does not configure. Bench skips Nexus tuning + Nexus-side BPDU attestation; Linux-bridge BPDU attestation continues to run.

dut_type is independent of all of the above.

Gating behavior per layer

The single-node-without-Nexus case (the user's current setup) is the exemplar. For every layer that previously assumed Nexus, this ADR defines what happens when l2_fabric == none:

  • scripts/k8s-install.sh — skips the Nexus tuning step. Tuning is announced in the install log as "skipped — l2_fabric=none".
  • scripts/airgap-l2-verify.sh — accepts --no-nexus flag (also honors env L2_FABRIC=none so the install harness can set it once and forget). Layer 5 of the report's airgap attestation gets nexus.skipped: true with reason: "no L2 fabric in this topology". Linux-bridge BPDU attestation still runs (the bridges are inside the UCS and exist regardless of external switch).
  • dashboard/src/lib/preflight/airgap-checks.tsairgapL2BpduIsolation receives l2FabricPresent: boolean via the AirgapCheckContext. When false, the Nexus subcheck is not even dispatched (no SSH attempt, no error logged); the result keeps nexus.skipped: true with the by-design reason.
  • dashboard/templates/annex-g-airgap-attestation.md.tmpl — already has a {{#nexus.skipped}} Mustache section. The reason text gets a tonally-neutral phrasing when the cause is "by design" vs the alarm-tone phrasing for "credentials missing" or "SSH timed out".
  • Run report cover page — adds a single line "Topology: single-node · L2 fabric: none" so the customer reading the report knows what shape was attested.

What does NOT change

  • Linux-bridge BPDU isolation (k8s/dut/48-bpdu-guard-daemonset.yaml) continues to run unconditionally on every DUT data-plane node. The bridges are inside the UCS regardless of external switch presence.
  • The 3-layer L2 BPDU defense methodology (ADR 0009) stays valid; layer 2 (Nexus) just becomes conditional — when absent, the defense collapses to 2 layers (Linux bridges + live capture), which is the correct posture for a deployment without an external switch to defend against.
  • The 5 air-gap layers (ADR 0007 §G + ADR 0009) are unchanged in number — layer 5 just attests fewer sub-sources when fabric is absent.

Single-node cabling: multi-nic-trunk

The user's current setup uses multiple physical NICs on the UCS, each carrying an 802.1q-tagged trunk to a different NGFW interface. This is the recommended pattern and the default when single_node.cabling: multi-nic-trunk is not overridden:

  • Each NIC is a separate physical link: enp1s0, enp1s1, etc.
  • Each NIC connects to a different NGFW interface (subscriber-side, server-side, management, etc.).
  • Each NIC carries the appropriate VLANs as 802.1q sub-interfaces (enp1s0.20, enp1s0.30, etc.).
  • Linux bridges + Multus NADs handle VLAN demux internally; from the pod's POV the topology looks identical to the multi-node case with a Nexus.

Why multi-NIC-trunk over single-NIC-trunk:

  • Bandwidth distribution: a single 10G link to the NGFW would saturate before reaching the agent fleet's combined throughput. Multiple 10G NICs give linear scaling.
  • Traffic-class isolation: agents-class traffic and persona-class traffic on different NICs do not interfere. A burst of TLS handshakes from agents does not back up persona-bound traffic.
  • Failure isolation: a NIC failure takes down one traffic class rather than the entire bench.

The single-nic-trunk alternative is supported (one NIC carrying every VLAN) for environments where physical NICs are scarce, but is not the default.

Consequences

Positive

  • Single-node lab without Nexus becomes a first-class deployment, matching the user's actual setup. No more silent assumption that Nexus is present.
  • The Nexus is no longer a hard prerequisite. Operators with Catalyst 9300, Arista, etc. can run the bench by setting l2_fabric: none (we don't configure their switch — they configure it manually) and the bench attests what it can.
  • The configuration matrix is explicit. New deployment shapes (e.g. cloud-native single-node on AWS) are just new axis values, not new enum constants spread across N scripts.
  • DUT type axis decouples NGFW vendor from topology. v4.6 (Cisco SR), v4.7 (PAN), v4.8 (Fortinet) all add dut_type values without needing to touch any topology / fabric code paths.

Negative

  • One more configuration file. topology.yaml joins personas.yaml, platform/test-plans/catalog.yaml, etc. as repo- versioned config the operator must understand. Mitigated by making the defaults sensible (single-node + l2_fabric=none + dut_type=cisco-ftd is the dev-setup default).
  • More code paths to test. Every place that touches L2 attestation now has two branches (with-Nexus / without-Nexus). Mitigated by funneling all "is Nexus present?" decisions through one helper reading the loaded topology, rather than scattered if-checks.

Neutral

  • Existing single-node operators (the user) get the right behavior out-of-box without any config change — l2_fabric: none is the inferred default when deployment_nodes: single.

Alternatives considered

Alternative A — Keep the single deployment-mode enum, add single-node-no-nexus as a 5th value

Reject. Combinatorial explosion the moment a 6th case appears (e.g. "dual-node-arista" or "tri-node-no-nexus"). Three independent axes multiply linearly; one mega-enum multiplies exponentially.

Alternative B — Detect L2 fabric presence at runtime (probe SSH to the configured Nexus address)

Reject. Adds a probe step at every install / preflight, and is fundamentally guessing — a Nexus that is up but unreachable from this moment looks identical to "no Nexus exists". The operator's intent should be declarative, not inferred.

Alternative C — Make L2 fabric optional at the script level, leave dashboard / report templates assuming Nexus

Reject. The annex G template would print misleading "Nexus check failed" entries on every report from a single-node deployment. Operator-facing surface (cover page, dashboard cards) needs to reflect the topology the run actually attested — anything else damages report defensibility.

Alternative D — Drop the dut_type axis (treat each release as a fork)

Reject. v4.6 needs to coexist with v4.5 for customers that have FTD in production today and Cisco SR coming next year. A single bench running both is the realistic scenario; two parallel forks is operationally untenable.

Implementation references

  • platform/topology.yaml — single source of truth, this PR
  • dashboard/src/lib/topology/loader.ts — Zod schema + loader + cache, this PR
  • dashboard/src/lib/preflight/airgap-checks.tsairgapL2BpduIsolation accepts l2FabricPresent via context, this PR
  • dashboard/src/lib/preflight/environmental.ts — composer reads topology cache and forwards l2FabricPresent, this PR
  • dashboard/templates/annex-g-airgap-attestation.md.tmpl — already has nexus.skipped Mustache section, no template change needed; the preflight check supplies a tonally-neutral reason string
  • scripts/k8s-install.sh — gates Nexus tuning by topology, deferred to a follow-up PR (operator-facing scripts touch real machines, warrant a separate review)
  • scripts/airgap-l2-verify.sh — accepts --no-nexus flag, deferred to the same follow-up PR
  • docs/L2_ISOLATION.md (+ .pt-BR.md, .es.md) — adds "When no L2 fabric exists" section, deferred to the same follow-up PR
  • CLAUDE.md — deployment table + stack table, this PR

References

  • ADR 0007 — Public-Internet Realism (5-layer airgap; layer 2 is the one this ADR makes conditional)
  • ADR 0009 — L2 BPDU isolation (3-layer defense; layer 2 of that defense becomes conditional)
  • ADR 0010 — Inspection Profile (the per-vendor dut_type axis is exercised here for the first time, with cisco-ftd as v4.5 MVP and cisco-secure-router as v4.6)
  • IEEE 802.1Q — VLAN trunking (the multi-NIC-trunk single-node pattern)