Skip to content

TLSStress.Art — Architecture overview

Concise tour of TLSStress.Art. Pairs with the Architecture Decision Records.

Scope status: SCOPE FREEZE locked 2026-05-10 evening. Vision phase closed; execution phase active. 37 MÓDULOs, 7 Test Kinds, 25 patent claims (17 original + 8 new ZTP-prem #18–#25), ~133 PRs across Phase 1 (Materialization).

Release status: v3.7.0 (2026-05-12) — Zero-Trust-on-Premises closed at 12 of 12 layers operationally. See docs/SECURITY_ZTP_PREM.md for the canonical security reference page and the § ZTP-prem section below for how it integrates with the rest of the architecture.

MÓDULO X.Art — canonical component naming (locked 2026-05-09, expanded 2026-05-10)

The platform exposes its components to operators and investors as MÓDULO X.Art (with the .Art suffix reinforcing brand identity). Internal code keeps technical naming (vyos-*, web-agent, etc.) to avoid cross-cutting refactor; UI/docs/diagrams/alerts use the customer-facing labels via a moduleLabel(internalName) helper.

Customer-facing Internal code/manifest Function
MÓDULO ISP.Art vyos-isp-router Synthetic ISP / default route from NGFW
MÓDULO BGP-1..4.Art vyos-bgp-peer-{1..4} 4 eBGP peers, per-peer prefix count
MÓDULO OSPF.Art vyos-ospf-peer OSPFv2/v3 LSA injection (up to 2M /30 LSAs)
MÓDULO SDWAN CoR-1..10.Art vyos-vpn-remote-{1..10} 10 IPSec tunnels, per-tunnel bandwidth + workload mix
MÓDULO VXLAN-1..3.Art vyos-vtep-{playwright,k6,mac-arp} VTEP front-end for Agents (TRUST-only VXLAN)
MÓDULO PW.Art browser-engine agents (web-agent) HTTP/2 + HTTP/3 browser-driven load
MÓDULO K6.Art synthetic-load agents (k6-agent) TCP/TLS load with p50/p95/p99 stats
MÓDULO GO.Art control-plane-stress-agent MAC + ARP/NDP table flooding (Go agent)
MÓDULO IPr.Art iperf agents (future SDWAN-Agent) Inner workload generator inside SDWAN tunnels
MÓDULO PERSONAS.Art persona pods (persona-* ns) 100 personas / 20 countries / real NREN public IPs
MÓDULO CLONER.Art cloner pod 9 functions: clone + NTP + feedback + catalog refresh + API discovery + patch+TopURL fetch + upgrade + cloud-proxy + PVI orchestrator
MÓDULO FLOW.Art NetFlow/IPFIX receiver Flow telemetry collector
MÓDULO SYSLOG.Art promtail-syslog Syslog ingestion (Promtail)
MÓDULO SNMP.Art snmp_exporter SNMP polling of Nexus + NGFW
MÓDULO API INFRA.Art DUT API orchestrator + future Cloner ext Vendor REST orchestration (FMC, FortiOS, PAN-OS, Gaia, etc.)
MÓDULO CLI.Art ansible-orchestrator SSH/TELNET orchestration via Ansible playbooks
MÓDULO CA.Art ⭐ 2026-05-10 cert-manager + persona-ca-issuer + sub-CA tlsstress-fleet-ca Certificate orchestration (persona certs, NGFW CA, mTLS to cloud)
MÓDULO DoYour.Art ⭐ 2026-05-10 doyour-studio (gVisor sandbox) Operator-crafted tests via Scapy + Go embed + PCAP replay (3 modes Art Studio UI)
MÓDULO KALI.Art ⭐ 2026-05-10 kali-pod (gVisor sandbox) Full Kali Linux pen-test distro (600+ tools, browser terminal, Team+ tier)
MÓDULO HAR.Art ⭐ 2026-05-10 har-replay Application-layer HAR replay (10k sessions/host scale, WAF tuning, regression)
MÓDULO VALIDATOR.Art ⭐ 2026-05-10 validator-api + ML cortex sidecar Central enrollment, role assignment, ML-driven orchestration (cloud-portable)
MÓDULO TREX.Art ⭐⭐ 2026-05-10 trex-pod (DPDK + hugepages) Cisco TRex stateful traffic gen — line-rate TCP/UDP/IPSec (30Mpps/core, Team+)
MÓDULO SPAN.Art ⭐⭐ 2026-05-10 span-collector (libpcap → AF_XDP → DPDK → SmartNIC tiers) Line-rate packet capture / mirror — JA3/JA4 + TLS metadata + DUT validation
MÓDULO RELAY.Art ⭐⭐ 2026-05-10 relay-bridge (HA pair) Bridge OOBI ↔ customer MGMT — telemetry ingress + control egress (read-only default)
MÓDULO GATEWAY.Art ⭐⭐ 2026-05-10 gateway-proxy (operator entry) LDAP+SAML+passkey + RBAC + audit; OBP reverse-tunnel acceptor

Total: 37 MÓDULOs.Art mapped (28 original + 9 added 2026-05-10). Detailed mapping in memo feedback_module_naming_2026_05_09 and post-Scope-Freeze additions in discuss_module_*_2026_05_10.

3-plane administrative classification (locked 2026-05-10)

The 37 MÓDULOs are administratively split into 3 planes based on function (see memo project_module_planes_classification_2026_05_10):

Plane Count MÓDULOs
DATA PLANE 22 ISP, SDWAN CoR-{1..10} (10), VXLAN-{1..3} (3), PW, synthetic-load engine, GO, IPr, DoYour, KALI, TREX, HAR
CONTROL PLANE 5 BGP-{1..4} (4), OSPF
MGMT PLANE 10 CLONER, FLOW, SYSLOG, SNMP, API INFRA, CLI, CA, VALIDATOR, RELAY, GATEWAY + PERSONAS controller (hybrid)

Storage gravity principle: MGMT MÓDULOs with heavy storage (SYSLOG, FLOW, CLONER cache, Infra Stack TSDB+reports) are MANDATORY on-prem regardless of cloud-split mode. MGMT-light (CA, API INFRA, CLI, SNMP, VALIDATOR, RELAY, GATEWAY) are cloud-portable.

7 canonical Test Kinds (locked post-Scope-Freeze)

Kind Purpose
tls-throughput Legacy two-leg TLS decryption throughput
branch-office Free-form Mbps/Gbps WAN-only asymmetric
inspection-profile 5 named profiles (minimal/balanced/paranoid/compliance/sandbox)
sdwan-cor SDWAN/Cloud On-Ramp + DIA
bgp-saturation Control-plane RIB stress (real Internet snapshot or synthetic)
mac-arp-stress L2 capacity stress (table fill)
pure Production URL Replay Engine (real-world site replay against DUT)

OOBI orchestration plane — VXLAN fabric (locked 2026-05-10, immutable canon)

In multi-node deployments, the Infra Stack (Dashboard + Postgres + Prometheus + Grafana + MÓDULO API INFRA.Art + MÓDULO CLI.Art) reaches each MÓDULO via OOBI on the existing VLAN 99 mgmt segment. Single-node deployments use the same VLAN as in-host bridge — zero infra change.

OOBI is canonically immutable (per ADR 0019, locked 2026-05-10): - IPv4: 10.254.254.0/24 (VLAN 99) - IPv6: fd5a:7c5e:a72:0::/64 (ULA) - VXLAN VNI: 254254 over UDP/4789 (HER underlay) - 3-phase bootstrap: Cabling → mDNS Discovery → Overlay up

Subnet 10.254.254.0/24 (VLAN 99) carve-out:

Range Function
.1-.3 Existing: Nexus 9000 SVI gateway, Nexus MGMT0, NGFW management interface (intact)
.4-.49 Reserved DUT mgmt expansion + existing infra (snmp_exporter, etc.)
.50-.79 MGMT-light MÓDULOs (CA .80, DoYour .81, KALI .82, HAR .83, VALIDATOR .84)
.80-.84 Special slots locked 2026-05-10
.85-.99 Spare for future MGMT-light MÓDULOs
.100 Infra Stack VIP (orchestration source)
.220 MÓDULO TREX.Art (DATA plane data-leg, MGMT entry on OOBI)
.230 MÓDULO SPAN.Art (line-rate capture, MGMT entry on OOBI)
.240-.241 MÓDULO RELAY.Art (HA pair — bridge OOBI ↔ customer MGMT)
.250 MÓDULO GATEWAY.Art (operator entry proxy — LDAP+SAML+passkey)

NetworkPolicy enforces that only 10.254.254.100 (Infra Stack VIP) can initiate connections to MÓDULOs OOBI IPs. Lateral movement (MÓDULO ↔ MÓDULO via OOBI) is blocked by default. DUTs and customer-side switches NEVER join the OOBI overlay — RELAY.Art bridges to them via dedicated MGMT NICs (read-only default).

DOM — DUT Operating Mode (locked 2026-05-10)

Per ADR 0014 (DOM family) the bench's behavior is gated by 5 discriminator modes:

Mode Description Production-blocking workflows?
greenfield New DUT, no production traffic none
staging Pre-prod mirror warn-only
lab Dedicated lab DUT warn-only
production Live customer DUT all destructive ops blocked by DDPB chain
prod-partition Production but operator-quarantined slice partial — destructive ops require explicit unlock

DOM family includes: CAE (Conditional Automation Engine), CPOS (Customizable Profile / Override Stack with atomic 2-phase commit), PIE family (RSM/IR/HID/AAE), DDPB (Defense-in-Depth Production Blocking, 7-layer chain), PDD (Production Drift Detection), SPP (Smart Profile Predictor — ML cortex integration), PIE-PA (3-layer defense for PURE in production: pod scale-to-0 + BGP withdraw + DNS sanity).

PURE — Production URL Replay Engine (locked 2026-05-10)

Test Kind #7 (added post-Scope-Freeze, 17 PRs). Replays real customer URLs through the DUT for production-mode validation.

  • Discovery Hub ingests from 8 sources: Syslog · Vendor API · PCAP · HAR · Curated · SPAN · Cloud-derived · KALI nmap import
  • PVI (Pre-flight Validation, Ingestion-time): CLONER fn #9 ephemeral synthetic-load engine/PW probe pods, 3-stage cascade, internet-direct via CLONER egress
  • PVP (Pre-flight Validation, Pre-test): DUT-delta scope check before each run
  • PIE-PA 3-layer defense MANDATORY in production (avoids MITM risk from personas on real public IPs 200.130.x.x)
  • Pre-curated Tranco/Umbrella/Majestic monthly refresh + air-gap fallback + per-test audit trail

ZTP-prem — Zero-Trust-on-Premises posture (locked 2026-05-12)

Zero-Trust-on-Premises is the 12-layer security posture that defends the bench against an insider operator with kubectl and root credentials inside the customer's own organisation. Closed at 12 of 12 layers operationally with v3.7.0. Canonical reference page: docs/SECURITY_ZTP_PREM.md.

# Layer Status Reference
01 Cloud HSM custody runtime gate shipping; real HSM probe scheduled dashboard/src/lib/license/hsm-heartbeat.ts
02 Confidential Computing detection shipping (per-node DaemonSet) pkg/ztp-prem-detect/
03 TPM measured boot probe scaffold shipping pkg/ztp-prem-tpm/
04 Sealed audit hash-chain shipping with replay verifier dashboard/src/lib/license/sealed-audit.ts
05 K8s admission webhook audit + enforce + break-glass + chain cross-correlation pkg/ztp-prem-admission/ · k8s/ztp-prem/admission-webhook.yaml
06 Anti-debug runtime distroless + readonly-rootfs + dropped caps every Tier B Dockerfile
07 Tier A/B partition policy YAML + CI lint platform/ztp-prem/tier-policy.yaml
08 UTXO token vault shipping (notes-not-balances) dashboard/src/lib/license/utxo.ts
09 Tier B obfuscation CI gate shipping (garble matrix) scripts/ztp-prem-obfuscate-tier-b.sh
10 DLP egress monitor shipping (5 rules + redaction policy) dashboard/src/lib/dlp/
11 Behavioural anomaly detector rule-based 4-detector shipping dashboard/src/lib/ztp-prem/anomaly-detector.ts
12 Separation of duties CODEOWNERS + nightly SoD audit + policy memo .github/CODEOWNERS · platform/ztp-prem/SEPARATION-OF-DUTIES.md

Three market-differentiated primitives (Patent claims #18–#25):

  • Cross-language signing contract (Patent #18) — Go (pkg/ztp-prem-signctl/canonical.go) and Node (dashboard/src/lib/license/envelope.ts) produce byte-identical 295-byte canonical envelope signatures, including post-quantum primitives (ML-KEM-768 + ML-DSA-65). Verified across runtimes.
  • UTXO token vault (Patent #19) — tokens are tamper-evident notes, not a mutable balance. No balance column in the schema. No race condition possible because no row is ever updated.
  • Admission ↔ sealed-audit cross-correlation (Patent #20) — K8s admission decisions are lifted from the webhook's local ring buffer into the sealed audit hash-chain via dashboard/src/lib/ztp-prem/admission-correlate.ts. Pod restarts do not erase forensic evidence.

Operator-visible surface/admin/ztp-prem page ships 7 cards in v3.7.0: Confidential Computing status · Tier A/B policy · License envelope summary · Envelope import · Sealed audit replay verifier · DLP egress monitor · Admission audit (with mode banner + DENIED/BREAK-GLASS badges + Correlate-to-sealed-audit button). Four admin endpoints back the cards plus three new ones landed in v3.7.0: /admission-audit · /admission-correlate · /hsm-heartbeat · /anomalies.

Integration with the rest of the architecture: - ZTP-prem does not change the bench's test execution. It surrounds it with auditability + admission policy + key custody. - The license authorize() gate is the chokepoint where ZTP-prem signals are consumed (HSM heartbeat freshness, UTXO note availability, sealed audit append, admission decision recorded). - All ZTP-prem env flags default OFF / audit-only on v3.7.0 — enforcement flips (admission enforce mode, ZTP_PREM_HSM_REQUIRED=true) are explicit operator actions following the canary rollout guides embedded in the K8s manifest comments + the operator playbook in docs/SECURITY_ZTP_PREM.md.

Stack diagram

flowchart LR
    subgraph operator["Operator"]
        UI["Next.js dashboard<br/>App Router · Tailwind · Drizzle"]
    end

    subgraph k8s["Kubernetes (or docker-compose)"]
        DASH["dashboard pods<br/>HMAC session · audit log<br/>SSE stream · Prometheus /api/metrics"]
        AGENTS["browser-engine agents (1..300)<br/>browser engine<br/>per-target circuit breaker<br/>idempotency keys · TLS 1.3"]
        K6AGENTS["synthetic-load agents (1..1000)<br/>grafana/k6 binary<br/>p50/p95/p99 latency<br/>error rate · throughput"]
        PG["postgres 16<br/>cluster_config · agents · runs<br/>k6_agents · k6_targets · k6_runs<br/>run_resources · metrics_buckets<br/>audit_log · idempotency"]
        subgraph ztpprem["ZTP-prem 12/12 camadas (v3.7.0)"]
            ADM["admission webhook<br/>audit · enforce · break-glass"]
            SEAL["sealed audit hash-chain<br/>WORM + cross-correlation"]
            DLP["DLP egress monitor<br/>5 baseline rules · observe-only"]
            VAULT["UTXO token vault<br/>LICENSE.Art envelope"]
        end
    end

    subgraph external["Public Internet"]
        SITES["Target HTTPS sites<br/>(g1, uol, NASA, Cisco …)"]
    end

    OP[(Browser)] -->|"/login · cookie"| UI
    UI <--> DASH
    DASH <-->|"/api/agents/* · runs/*<br/>Bearer token + Idempotency-Key"| AGENTS
    DASH <-->|"/api/k6/agents/* · k6/runs/*<br/>Bearer token"| K6AGENTS
    DASH -->|"SQL"| PG
    DASH -.->|"every admit / spend / fetch"| ADM
    ADM -.-> SEAL
    DASH -.->|"every egress"| DLP
    DLP -.-> SEAL
    DASH -.->|"licence check"| VAULT
    AGENTS -->|"navigation, full page load,<br/>per-target h2/h3 + profile"| SITES
    K6AGENTS -->|"k6 run (VUs, duration)<br/>HTTP/1.1 · HTTP/2 · HTTP/3"| SITES
    AGENTS -.->|"protocol, bytes, TCP open,<br/>tls version, web vitals"| DASH
    K6AGENTS -.->|"p50/p95/p99, error_rate,<br/>RPS, data_received"| DASH
    DASH -->|"Server-Sent Events"| OP

    classDef block fill:#fff,stroke:#000,stroke-width:1px
    classDef ext  fill:#f5f5f7,stroke:#000,stroke-width:1px
    classDef ztp  fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
    class DASH,AGENTS,K6AGENTS,PG block
    class SITES,UI ext
    class ADM,SEAL,DLP,VAULT ztp

ZTP-prem layer (v3.7.0): every dashboard action that touches a Tier-classified workload, spends a licence note, or issues outbound network egress flows through the admission webhook / licence vault / DLP wrapper respectively, and every decision is recorded into the sealed audit hash-chain (WORM, tamper-evident). See SECURITY_ZTP_PREM.md and the ZTP-prem ADR family.

Component breakdown

browser-engine agent (agent/)

  • Stateless Linux + headless Chromium worker, distributed via the official mcr.microsoft.com/playwright base image.
  • Per-cycle lifecycle:
  • register (idempotent, retried with exp backoff + jitter);
  • fetchConfig (every POLL_INTERVAL_MS);
  • for each tick, pick the target whose next-run-at is earliest (bursting targets jump the queue);
  • open a fresh BrowserContext (no cookies / cache / IndexedDB reuse), navigate, parse subresources, capture protocol / tcp_sockets_open / mbps_avg;
  • close the context, wipe the on-disk profile, schedule the next run for that target.
  • Browser pool keyed by (httpVersion, host) so HTTP/3 forcing (--enable-quic --origin-to-force-quic-on=…) recycles minimally.
  • Sentinel relaunches Chromium after N consecutive browser-level failures; per-target circuit breaker pauses misbehaving targets.
  • CYCLE_CONCURRENCY runs N parallel cycles inside one Chromium for per-replica throughput.

synthetic-load agent (k6-agent/)

  • Stateless TypeScript + Node.js process that wraps the official grafana/k6 binary (copied from the grafana/k6:0.56.0 base image at build time — no Alpine package dependency).
  • Per-cycle lifecycle:
  • register with the dashboard (idempotent upsert);
  • heartbeat every HEARTBEAT_INTERVAL_MS;
  • poll /api/k6/agents/{id}/config every POLL_INTERVAL_MS;
  • for each enabled target whose next-run-at is due, POST /api/k6/runs/start, write a k6 script to /tmp, spawn k6 run --summary-export /tmp/summary.json;
  • parse the summary JSON and POST /api/k6/runs/complete with p50/p95/p99, error_rate, rps, data_received.
  • /healthz HTTP probe on port 9090.
  • K8s-ready: readOnlyRootFilesystem: true; /tmp and /home/k6agent are emptyDir: medium: Memory volumes injected at pod creation.

Dashboard (dashboard/)

  • Next.js 15 App Router (TypeScript). All API routes are force-dynamic and emit cache-control: no-store.
  • Drizzle ORM over postgres-js. Lazy initialisation via a Proxy so next build works without a live database.
  • Authentication: HMAC-signed HttpOnly cookie on /login; legacy Authorization: Basic header is still accepted for programmatic callers. Constant-time comparison; rate-limited token bucket per client IP.
  • Server-Sent Events (/api/dashboard/stream) push a tick event whenever runs.{started_at,finished_at,total} changes; React Query invalidates caches on every tick. 30-second polling fallback for tunnels that strip event-stream framing.
  • Prometheus endpoint at /api/metrics, writing both in-memory counters (runs_completed_total, runs_started_total, runs_by_protocol_total, reaper_*) and live gauges (agents_live, runs_per_minute, cycle_avg_ms).
  • Dark mode via class strategy; i18n PT/EN/ES with persistence to localStorage.

Database

  • Postgres 16, tunable via env (compose default: shared_buffers=256MB, work_mem=8MB, max_connections=200).
  • Twenty-three idempotent migrations (00000023) covering the full schema evolution. Optional migration 0004 enables TimescaleDB hypertable + continuous aggregate when the extension is available. Migration 0014 adds the synthetic-load engine tables; migrations 00180023 extend the schema for clone-persona slots (0018), license acceptance (0019), the test-plan catalog (0020), internal revision tracking (0021), and DUT API device registry + encrypted-credential storage + sanitized snapshot tables (0023).
  • audit_log records every admin mutation (actor, source IP, before / after JSON). idempotency deduplicates retried POSTs and is GC'd by the reaper after 30 minutes.

Local-scaler (docker-compose.scaler.yml, opt-in by default in .env)

  • Mounts /var/run/docker.sock into the dashboard container.
  • The dashboard talks to the Docker Engine API over the unix socket and reconciles cluster_config.desired_agent_count with running containers labelled com.docker.compose.service=agent.
  • Runs the dashboard as the dashboard user joined to the host's docker group via group_add. scripts/init-env.sh detects the right DOCKER_GID for the host.

Kubernetes manifests (k8s/)

  • Namespace, Postgres StatefulSet, dashboard Deployment (with topologySpreadConstraints and PodDisruptionBudget), browser-engine agent Deployment plus HorizontalPodAutoscaler min=1, max=300 based on CPU/memory.
  • synthetic-load agent Deployment plus HPA min=0, max=1000 (scale-to-zero requires KEDA or Kubernetes ≥ 1.32); PDB maxUnavailable: 10%. Volumes: emptyDir: medium: Memory for /tmp (256 Mi) and /home/k6agent (64 Mi) — readOnlyRootFilesystem: true.
  • Two NetworkPolicy resources — one for browser engine, one for synthetic-load engine — each with default-deny ingress and an explicit egress allow-list (DNS, dashboard, public TCP 80/443, UDP 443 for QUIC; private CIDRs blocked).
  • Daily Postgres pg_dump CronJob writing to a 50 GiB PVC, 14-day retention.
  • PrometheusRule with five alerts: no agents live, agents below desired, error rate > 20 %, slow cycles, reaper churn.

DUT API engine (dashboard/src/lib/dut-api/)

A thin vendor-adapter layer that captures sanitized config + identity from the Device Under Test and persists hashed snapshots for every test run. Four adapters ship today, each implementing a common DutAdapter interface:

Adapter file Vendor Transport Captures
cisco-ftd.ts Cisco Firepower (FTD via cdFMC / FMC) REST Model, serial, version, decrypt-policy state, sanitized running-config
cisco-nexus.ts Cisco Nexus 9000 NX-API (HTTPS) Model, serial, version, interface counters, port-channel state
cisco-ucs-cimc.ts Cisco UCS chassis Redfish (HTTPS) Hardware inventory, PSU, fans, temp, BIOS
fortinet-fortigate.ts FortiGate REST (FortiOS) Model, serial, version, sanitized config, session-table summary

encryption.ts wraps device credentials with AES-GCM at rest. poller.ts runs a 60 s liveness loop; failures surface on /admin/dut-api. snapshot-storage.ts canonicalizes the snapshot JSON, hashes it (SHA-256), and persists both blob + hash so the report-builder can embed Annex B/C/D and prove the device state at run start. registry.ts is the device-registry CRUD with integration into audit_log (every register / update / delete is recorded).

Test-plan + test-run + pre-flight engines

Three small modules, kept independent:

  • Test-plan catalogplatform/test-plans/catalog.yaml is loaded at Dashboard startup into test_plans (0020). Each plan has an immutable identifier (e.g. CAP-FIND-KNEE-30M) plus phases[] and requirements{}.
  • Test-run executordashboard/src/lib/test-runs/ runs through phases: preflight gate → snapshot DUTs → freeze plan (planSnapshotSha256) → fan-out runs to agents → collect results → write report.json → generate the print-styled HTML report.
  • Pre-flight enginedashboard/src/lib/preflight/ runs a 5-check catalog (subnet conflict OOBI ↔ ISP ↔ persona VLANs, NGFW reachability + decrypt mode matches plan, persona PKI freshness, NTP relay clock skew within tolerance, DUT API auth + snapshot succeeds). Failures block run start; bypass writes to audit_log and shows on the report cover.

Multi-node overlay (overlays/multi-node/)

For deployments across multiple physical servers, the Kustomize overlay at overlays/multi-node/ applies nodeSelector patches to each Deployment/StatefulSet so each workload tier is pinned to its dedicated server:

Patch file Target workload Pinned to
patch-playwright-nodesel.yaml web-agent Deployment role=playwright (UCS-2)
patch-k6-nodesel.yaml k6-agent Deployment role=k6 (UCS-3)
patch-infra-nodesel.yaml dashboard, postgres, pgbouncer, cloner role=infra (UCS-4)
patch-cloner-nodesel.yaml cloner Deployment role=infra (UCS-4)

The 30 persona webservers (Caddy) — 20 Synthetic Personas (VLANs 101–120) plus 10 Cloned Persona slots (VLANs 200–209) — are pinned to UCS-1 (role=ngfw-dut) via the DUT overlay. The tuning DaemonSet is patched post-deploy to use dut-data-plane=true so it runs on UCS-1/2/3 but not UCS-4.

Public Website Cloner (k8s/80-83-cloner-*.yaml)

The cloner is an optional Deployment (1 replica) that downloads real public websites using headful browser engine with stealth, and serves the static mirror at http://clone-serve.web-agents.svc.cluster.local:8081/{personaName}/.

Dual-NIC architecture — VLAN 40:

eth0 (OOBI / K8s default)   ──►  Dashboard API, heartbeat, Prometheus scrape
net1 (VLAN 40 / Multus)     ──►  Internet via eth1.40 — HTTP/2, HTTP/3, DNS, ICMP
                                  DHCP IP from upstream ISP/lab router
                                  NOT routed through the NGFW
Interface VLAN Finalidade
eth0 OOBI: K8s control plane, Dashboard, CoreDNS
net1 40 ISP: internet egress direto (sem NGFW)

The cloner-isp NAD attaches a macvlan on eth1.40 (VLAN 40 on the Nexus 9000 trunk). An initContainer configures policy routing: all non-RFC1918 traffic is fwmarked to routing table 100 (default via ISP gateway on net1). DNS is forced to 8.8.8.8 / 208.67.222.222 regardless of DHCP, plus 10.96.0.10 (k3s CoreDNS) for internal names.

HTTPS and public CA certificates: The cloner accesses public HTTPS sites directly (no NGFW interception). Chromium validates certificates against the Debian ca-certificates bundle (Mozilla NSS roots). No special configuration is needed for standard public sites. The NGFW CA is not injected into the cloner — it is not needed because the cloner bypasses the NGFW entirely via VLAN 40.

Internet health monitor: pings 8.8.8.8, 1.1.1.1, and the DHCP gateway every 10 s. Exposes Prometheus metrics: cloner_internet_up{target}, cloner_internet_any_up, cloner_gateway_up{gateway}, cloner_ping_rtt_ms{target}.

Storage: 20 Gi local-path PVC (cloned-sites) at /mnt/cloned/{personaName}/. Served by the clone-serve ClusterIP Service on port 8081.

Multi-node: pinned to role=infra (UCS-4) — same node as Dashboard. VLAN 40 must be allowed on the trunk port from Nexus 9000 to UCS-4.

See CLONER_OPERATIONS.md for deployment, job management, Grafana monitoring, and full troubleshooting guide.

Apply with:

kubectl apply -k overlays/multi-node/

See UBUNTU_K3S_MULTINODE_QUICKSTART_DEPLOY.en.md for the full guide.

Data flow per cycle

browser-engine agent

sequenceDiagram
    participant A as Agent (Chromium)
    participant D as Dashboard
    participant P as Postgres

    A->>D: POST /api/agents/register (Bearer)
    D->>P: UPSERT agents
    A->>D: GET /api/agents/{id}/config
    D->>P: SELECT cluster_config + active targets
    loop every cycle
        A->>D: POST /api/runs/start (Idempotency-Key)
        D->>P: INSERT runs
        D-->>A: { runId }
        A->>A: navigate + capture metrics
        A->>D: POST /api/runs/complete (Idempotency-Key=runId)
        D->>P: UPDATE runs · INSERT run_resources
        D->>P: UPSERT metrics_buckets (60s)
    end
    A-->>D: POST /api/logs (heartbeat + cycle events)
    D-->>P: INSERT logs

synthetic-load agent

sequenceDiagram
    participant K as synthetic-load agent (Node.js)
    participant D as Dashboard
    participant P as Postgres
    participant B as k6 binary

    K->>D: POST /api/k6/agents/register (Bearer)
    D->>P: UPSERT k6_agents
    K->>D: GET /api/k6/agents/{id}/config
    D->>P: SELECT k6_targets (enabled)
    loop every target interval
        K->>D: POST /api/k6/runs/start
        D->>P: INSERT k6_runs
        D-->>K: { runId }
        K->>B: spawn k6 run --summary-export /tmp/summary.json
        B-->>K: exit 0 + JSON metrics
        K->>D: POST /api/k6/runs/complete (p50/p95/p99, error_rate, rps)
        D->>P: UPDATE k6_runs
    end
    K->>D: POST /api/k6/agents/{id}/heartbeat (every 30 s)
    D->>P: UPDATE k6_agents.last_seen_at

DUT test-bed topology (physical NGFW mode)

When scripts/stack-up.sh up-dut is used, the cluster is extended with a physical NGFW Device Under Test (DUT) connected over 802.1q VLAN trunks. The NGFW is the L3 default gateway for all three test-plane VLANs.

Physical + virtual topology

flowchart TB
    %% ═══════════════════════════════════════════════════════════════════════
    %% CISCO UCS — Ubuntu Linux host
    %% ═══════════════════════════════════════════════════════════════════════
    subgraph ucs["Cisco UCS — Ubuntu Linux host"]
        direction TB

        ETH1(["eth1\n802.1q trunk NIC"])

        subgraph vlans["Kernel: 802.1q VLAN subinterfaces"]
            direction LR
            V20["eth1.20\nVLAN 20"]
            V30["eth1.30\nVLAN 30"]
            V40["eth1.40\nVLAN 40\nCloner ISP"]
            V99["eth1.99\nVLAN 99 mgmt"]
        end

        subgraph nets["Docker networks (Docker mode) / K8s NADs (K8s mode)"]
            direction LR
            MPW["ai_forse_dut_pw\nmacvlan · 172.16.0.0/16\ngw 172.16.0.1"]
            MK6["ai_forse_dut_k6\nmacvlan · 172.17.0.0/16\ngw 172.17.0.1"]
            MMGMT["ai_forse_mgmt\nmacvlan · 10.254.254.0/24\ngw 10.254.254.1"]
            OOBI(["ai_forse_oobi\nbridge --internal\ncontrol plane"])
        end

        subgraph ctrs_data["Data-plane containers"]
            direction LR
            PW["browser engine\nagents\n:8080/healthz"]
            K6A["synthetic-load engine\nagents\n:9090/healthz"]
        end

        subgraph ctrs_obs["Observability containers"]
            direction LR
            SNMPEXP["snmp_exporter\n:9116"]
            PROM["Prometheus\n:9090"]
            GRAF["Grafana\n:3001"]
        end

        subgraph ctrs_ctrl["Control-plane containers"]
            direction LR
            DASH["Dashboard\n:3000"]
            PG["Postgres\n:5432"]
        end

        %% kernel bindings
        ETH1 --- V20 & V30 & V40 & V99

        %% macvlan ↔ container bindings
        V20 --> MPW  --> PW
        V30 --> MK6  --> K6A
        V99 --> MMGMT --> SNMPEXP

        %% oobi control plane (all containers share this bridge)
        OOBI --- PW
        OOBI --- K6A
        OOBI --- SNMPEXP
        OOBI --- PROM
        OOBI --- GRAF
        OOBI --- DASH
        OOBI --- PG
    end

    %% ═══════════════════════════════════════════════════════════════════════
    %% CISCO NEXUS 9000
    %% ═══════════════════════════════════════════════════════════════════════
    subgraph nexus["Cisco Nexus 9000 — NX-OS 10.5(4)M"]
        direction TB
        NTRUNK["Trunk port\nallowed VLAN 20,30,99,101-120,200-209"]
        NA20["Access port\nVLAN 20"]
        NA30["Access port\nVLAN 30"]
        NMGMT0(["MGMT0\n10.254.254.2"])
    end

    %% ═══════════════════════════════════════════════════════════════════════
    %% NGFW DUT
    %% ═══════════════════════════════════════════════════════════════════════
    subgraph dut["NGFW DUT — Device Under Test"]
        direction TB
        NI2["int-2 · VLAN 20\n172.16.0.1/16\nNet_Prod_browser engine"]
        NI3["int-3 · VLAN 30\n172.17.0.1/16\nNet_prod_K6_Agents"]
        NMGT(["Mgmt interface\n10.254.254.3/24"])
        DTLS{{"TLS decryption engine\nintercept · inspect · re-sign"}}
        NI2 --- DTLS
        NI3 --- DTLS
    end

    %% ═══════════════════════════════════════════════════════════════════════
    %% PHYSICAL CABLING
    %% ═══════════════════════════════════════════════════════════════════════
    ETH1     <-->|"802.1q trunk"| NTRUNK
    NA20     <-->|"L2 · VLAN 20"| NI2
    NA30     <-->|"L2 · VLAN 30"| NI3
    NMGMT0   <-->|"OOB mgmt cable"| NMGT

    %% ═══════════════════════════════════════════════════════════════════════
    %% DATA PLANE — TLS test traffic path
    %%   browser-engine/synthetic-load  →  macvlan  →  Nexus  →  NGFW (decrypt)
    %%                 →  Nexus  →  macvlan  →  Caddy
    %% ═══════════════════════════════════════════════════════════════════════
    PW  -->|"default route\n172.16.0.1"| NI2
    K6A -->|"default route\n172.17.0.1"| NI3

    %% ═══════════════════════════════════════════════════════════════════════
    %% CONTROL PLANE — stays on ai_forse_oobi, never leaves host
    %% ═══════════════════════════════════════════════════════════════════════
    DASH <-->|"Bearer · SSE\nIdempotency-Key"| PW
    DASH <-->|"Bearer · SSE\nIdempotency-Key"| K6A
    DASH --- PG

    %% ═══════════════════════════════════════════════════════════════════════
    %% OBSERVABILITY — Prometheus → snmp_exporter → physical devices
    %% ═══════════════════════════════════════════════════════════════════════
    PROM -->|"HTTP :9116\ncisco_nexus\n→ 10.254.254.2"| SNMPEXP
    PROM -->|"HTTP :9116\nSNMP_DUT_MODULE\n→ 10.254.254.3"| SNMPEXP
    SNMPEXP -->|"SNMP UDP/161"| NMGMT0
    SNMPEXP -->|"SNMP UDP/161"| NMGT
    PROM -->|"HTTP :9100\nnode_exporter\nfile_sd"| ETH1
    GRAF -->|"PromQL"| PROM

    %% ═══════════════════════════════════════════════════════════════════════
    %% STYLES
    %% ═══════════════════════════════════════════════════════════════════════
    classDef mac    fill:#dbeafe,stroke:#2563eb,stroke-width:1px
    classDef oobi   fill:#fce7f3,stroke:#db2777,stroke-width:1px
    classDef ctr    fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
    classDef sw     fill:#f3f4f6,stroke:#374151,stroke-width:2px
    classDef vlan   fill:#fef9c3,stroke:#b45309,stroke-width:1px
    classDef trunk  fill:#fff7ed,stroke:#c2410c,stroke-width:1px

    class MPW,MK6,MMGMT mac
    class OOBI oobi
    class PW,K6A,SNMPEXP,PROM,GRAF,DASH,PG ctr
    class NTRUNK,NA20,NA30,NMGMT0,NI2,NI3,NMGT,DTLS sw
    class V20,V30,V40,V99 vlan
    class ETH1 trunk

Traffic paths summary

Path Route
Data plane (synthetic-load engine example, K8s mode) synthetic-load engine pod (net1, VLAN 30) → trunk → Nexus (VLAN 30) → NGFW → TLS decrypt → NGFW outside → Nexus (VLAN 101-120 or 200-209) → Persona Caddy pod (net1)
Data plane (browser engine, K8s mode) browser engine pod (net1, VLAN 20) → trunk → Nexus (VLAN 20) → NGFW → TLS decrypt → NGFW outside → Nexus (VLAN 101-120 or 200-209) → Persona/Cloned Persona Caddy pod (net1)
Cloner ISP Cloner (net1) → eth1.40 (VLAN 40) → trunk → Nexus → upstream router → Internet (direct, no NGFW)
Control plane Any container → ai_forse_oobi (bridge, --internal) → Dashboard — never leaves the host
SNMP Prometheus → snmp_exporter :9116 (oobi) → ai_forse_mgmt (macvlan) → eth1.99 → trunk → Nexus MGMT0 (10.254.254.2) / NGFW mgmt (10.254.254.3)
node_exporter Prometheus → 10.254.254.x:9100 (Ubuntu hosts, file_sd) → management VLAN

DUT network segments

NGFW security zones locked 2026-05-09 — the NGFW has 8 interfaces (or sub-interfaces, depending on hardware) split across 3 zones:

VLAN Interface host NGFW zone Subnet Purpose
20 eth1.20 / dut-pw NAD TRUST / INSIDE 172.16.0.0/16 browser-engine agents (MÓDULO PW.Art) → through NGFW
30 eth1.30 / dut-k6 NAD TRUST / INSIDE 172.17.0.0/16 synthetic-load agents (MÓDULO K6.Art) → through NGFW
2900 eth1.2900 (NAD pending) TRUST / INSIDE 172.19.0.0/16 (synthetic spoof pool: 10.96.0.0/11 + fd00:1234::/32) MAC + ARP/NDP table stress (MÓDULO GO.Art) → NGFW
2901 eth1.2901 (NAD pending) TRUST / INSIDE 172.20.0.0/16 MÓDULO DoYour.Art (Scapy/Go/PCAP sandbox) → NGFW
2902 eth1.2902 (NAD pending) TRUST / INSIDE 172.21.0.0/16 MÓDULO KALI.Art (Kali Linux pen-test pod) → NGFW
2001 eth1.2001 / edge-link-vyos NAD UNTRUST / OUTSIDE 200.130.0.0/30 Edge link NGFW outside ↔ VyOS-ISP (NGFW=.1, VyOS=.2); the NGFW's default route
1982 eth1.1982 (NAD pending) UNTRUST / OUTSIDE 200.130.0.8/30 OSPF peer link — NGFW=.9, VyOS-OSPF-PEER=.10
2809 eth1.2809 / bgp-peer-vlan2809 NAD UNTRUST / OUTSIDE 200.130.0.12/30 + 2001:db8:0:2809::/126 ⚠️ BGP peer link (IP-migrated v4.3 — IPv6 still doc-prefix until v4.5+)
4338 eth1.4338 (NAD pending) UNTRUST / OUTSIDE 200.130.0.4/30 MÓDULO SDWAN/CoR-N.Art peer link (local mode) — NGFW=.5, VyOS peer=.6. Legacy term "VPN-REMOTE" refers to this same data-plane leg
99 eth1.99 / dut-mgmt + oobi-mgmt NADs MGMT 10.254.254.0/24 SNMP probe band (host-local .10-.49) + OOBI orchestration (static .50-.82 MÓDULOs)

VLANs that do NOT terminate on the NGFW:

VLAN Reason
40 (cloner-isp NAD, DHCP from upstream) Bypass — cloner egress goes directly to public Internet, never through NGFW
101–120 (Synthetic Persona NADs) Personas live BEHIND the VyOS-ISP pod (one country /24 per VLAN); NGFW only sees the single edge link (VLAN 2001) — VyOS-ISP routes onward to each country
200–209 (Cloned Persona slots, legacy) Being deprecated — cloned personas migrate to country /24s alongside Synthetic ones

Web test traffic flow (corrected for v4.3)

[ Agent pod (browser-engine/synthetic-load) ]
        │  VLAN 20 or 30 (TRUST/INSIDE)
        ▼
[ NGFW — TLS decrypt LEG 1 ]
        │  outside interface, VLAN 2001 (UNTRUST/OUTSIDE)
        │  default route → 200.130.0.2 (VyOS-ISP)
        ▼
[ VyOS-ISP — country-LAN router ]
        │  one outbound interface per country LAN
        │  routes to <NREN-/24>.10..14 (5 personas per country)
        ▼
[ Persona Caddy pod — TLS LEG 2 ]
        e.g. shop.us.persona.tlsstress.art = 198.32.10.10

Three personas-don't-talk-to-NGFW-directly consequences: 1. The NGFW interface inventory is 8 interfaces total (3 INSIDE + 4 OUTSIDE + 1 MGMT) — not 30+ as the legacy tables suggested. 2. Persona NADs (101–120) terminate on a per-country dummy interface inside the VyOS-ISP pod, not on eth1.10X host sub-interfaces facing the NGFW. 3. The legacy "NGFW gateway = 10.1.x.1" wording in older docs is incorrect for v4.3 — gateway is <NREN-/24>.1 on the VyOS-ISP pod. Cleanup queued separately (Personas v4.3 IP migration option B).

SNMP monitoring — supported vendors

The snmp_exporter container is dual-homed: OOBI (Prometheus scrapes it) and ai_forse_mgmt macvlan (physical reach to devices).

SNMP module Device Key metrics
cisco_nexus Cisco Nexus 9000 CPU/mem/temp/fan/PSU/CRC/queue drops
cisco_ftd Cisco Firepower FTD 7.x sessions, CPS, HW crypto engine, interfaces
cisco_iosxe Cisco C8475-G2 IOS-XE CPU/mem/temp/fan/PSU, queue drops
cisco_meraki Meraki MX IF-MIB only (Dashboard API needed for more)
fortinet_fortigate FortiGate FG200F/FG600F fgSysCpuUsage, fgSysSesCount, fgSysSesRate1
palo_alto PA-series PAN-OS panSessionActive, panSysCpuUtilization, throughput
checkpoint Check Point Quantum/R81 fwNumConn, fwDropped, svnPerfCPU
huawei_ngfw Huawei USG/NGFW hwEntityCPUUsage, hwEntityMemUsage, temperature
generic_ngfw Any vendor Standard RFC MIBs (IF-MIB + HOST-RESOURCES-MIB)

Set SNMP_DUT_MODULE in .env to select the module. The Nexus 9000 always uses cisco_nexus.

Ubuntu host monitoring (Cisco UCS)

Physical Ubuntu hosts (Cisco UCS servers) are monitored via node_exporter running on each host. Prometheus scrapes them using file-based service discovery (observability/prometheus/targets/ubuntu-hosts.yml) — edit the file to add/remove hosts; Prometheus reloads in 30 s, no restart required.

For Cisco UCS chassis-level sensors (CIMC temperature, fans, PSU), enable SNMP on CIMC and add the CIMC management IP as a separate generic_ngfw scrape job.

DUT Grafana dashboards

Dashboard UID Description
Nexus 9000 dut-nexus9000 Switch health: CPU, memory, temp, fans, PSU, CRC/FCS, queue drops, optical sensors
NGFW DUT dut-ngfw DUT: throughput per VLAN, active sessions, CPS, HW crypto, CPU, memory, temp, CRC
Ubuntu Hosts (UCS) dut-ubuntu-hosts OS-level: CPU, load, memory, disk I/O, filesystem, network, processes, Docker
UCS CIMC — Chassis dut-ucs-cimc-chassis UCS chassis: PSU, fans, temp, BIOS — Redfish-sourced via the DUT API adapter
UCS CIMC — Hardware dut-ucs-cimc-hardware UCS blade: CPU, memory, disk, network adapters, IOC controllers
TLS Decrypt Mode decrypt-mode Issuer-cert ground-truth probe (ACTIVE / BYPASS), TLSDecryptModeChanged alert state
Cloner health cloner ISP up, ping 8.8.8.8 / 1.1.1.1, gateway ping, RTT timeseries

Security posture (high level)

For the complete 12-layer security posture (Zero-Trust-on-Premises), see SECURITY_ZTP_PREM.md and the § ZTP-prem section above. This section covers the operational baseline that applies to every deployment regardless of ZTP-prem layer-flip state.

  • Containers: non-root, dropped Linux capabilities, RuntimeDefault seccomp, read-only root filesystem (where possible).
  • Egress NetworkPolicy blocks RFC1918 + link-local CIDRs from the agent fleet by default.
  • SECURITY.md defines reporting flow + SLA; audit_log records every config mutation; gitleaks blocks committed secrets in CI; trivy scans both images on every PR; CodeQL static analysis weekly.
  • Container images are pushed multi-arch and signed with Cosign keyless via the GitHub OIDC provider; SBOM (SPDX) attached as a registry attestation on every release tag.
  • Pre-flight check engine runs a 5-check catalog before every test-run start: subnet conflict (OOBI ↔ ISP ↔ persona VLANs), NGFW reachability + decrypt mode matches plan, persona PKI freshness, NTP relay clock skew within tolerance, DUT API auth + snapshot succeeds. Bypass is logged in audit_log and printed on the Test Run Report cover.
  • DUT API credentials encrypted at rest with AES-GCM (encryption.ts); sanitized snapshots hashed (SHA-256) and embedded in the Test Run Report Annex B/C/D so a reviewer can verify what device state held during the run.
  • License acceptance modal records every first-load session in audit_log against the LicenseRef-PolyForm-Noncommercial-1.0.0-with-Appendix-A identifier; deployments can revoke session acceptance centrally if an audience-policy violation is suspected.
  • IP protection — all forensic identifiers (cert fingerprints, deployment hashes, asset signatures, decrypt-mode timeline) live in a separate private/forensic repository with the project owner as sole collaborator; the public repo never contains identity-binding data. Rationale: GitHub does not support per-branch ACLs, so any in-tree forensic branch would leak to every collaborator with pull on the public repo. See docs/IP_PROTECTION.md.