TLSStress.Art — Architecture overview¶
Concise tour of TLSStress.Art. Pairs with the Architecture Decision Records.
Scope status: SCOPE FREEZE locked 2026-05-10 evening. Vision phase closed; execution phase active. 37 MÓDULOs, 7 Test Kinds, 25 patent claims (17 original + 8 new ZTP-prem #18–#25), ~133 PRs across Phase 1 (Materialization).
Release status: v3.7.0 (2026-05-12) — Zero-Trust-on-Premises closed at 12 of 12 layers operationally. See
docs/SECURITY_ZTP_PREM.mdfor the canonical security reference page and the § ZTP-prem section below for how it integrates with the rest of the architecture.
MÓDULO X.Art — canonical component naming (locked 2026-05-09, expanded 2026-05-10)¶
The platform exposes its components to operators and investors as
MÓDULO X.Art (with the .Art suffix reinforcing brand identity).
Internal code keeps technical naming (vyos-*, web-agent, etc.) to
avoid cross-cutting refactor; UI/docs/diagrams/alerts use the
customer-facing labels via a moduleLabel(internalName) helper.
| Customer-facing | Internal code/manifest | Function |
|---|---|---|
| MÓDULO ISP.Art | vyos-isp-router |
Synthetic ISP / default route from NGFW |
| MÓDULO BGP-1..4.Art | vyos-bgp-peer-{1..4} |
4 eBGP peers, per-peer prefix count |
| MÓDULO OSPF.Art | vyos-ospf-peer |
OSPFv2/v3 LSA injection (up to 2M /30 LSAs) |
| MÓDULO SDWAN CoR-1..10.Art | vyos-vpn-remote-{1..10} |
10 IPSec tunnels, per-tunnel bandwidth + workload mix |
| MÓDULO VXLAN-1..3.Art | vyos-vtep-{playwright,k6,mac-arp} |
VTEP front-end for Agents (TRUST-only VXLAN) |
| MÓDULO PW.Art | browser-engine agents (web-agent) |
HTTP/2 + HTTP/3 browser-driven load |
| MÓDULO K6.Art | synthetic-load agents (k6-agent) |
TCP/TLS load with p50/p95/p99 stats |
| MÓDULO GO.Art | control-plane-stress-agent |
MAC + ARP/NDP table flooding (Go agent) |
| MÓDULO IPr.Art | iperf agents (future SDWAN-Agent) | Inner workload generator inside SDWAN tunnels |
| MÓDULO PERSONAS.Art | persona pods (persona-* ns) |
100 personas / 20 countries / real NREN public IPs |
| MÓDULO CLONER.Art | cloner pod |
9 functions: clone + NTP + feedback + catalog refresh + API discovery + patch+TopURL fetch + upgrade + cloud-proxy + PVI orchestrator |
| MÓDULO FLOW.Art | NetFlow/IPFIX receiver | Flow telemetry collector |
| MÓDULO SYSLOG.Art | promtail-syslog |
Syslog ingestion (Promtail) |
| MÓDULO SNMP.Art | snmp_exporter |
SNMP polling of Nexus + NGFW |
| MÓDULO API INFRA.Art | DUT API orchestrator + future Cloner ext | Vendor REST orchestration (FMC, FortiOS, PAN-OS, Gaia, etc.) |
| MÓDULO CLI.Art | ansible-orchestrator |
SSH/TELNET orchestration via Ansible playbooks |
| MÓDULO CA.Art ⭐ 2026-05-10 | cert-manager + persona-ca-issuer + sub-CA tlsstress-fleet-ca |
Certificate orchestration (persona certs, NGFW CA, mTLS to cloud) |
| MÓDULO DoYour.Art ⭐ 2026-05-10 | doyour-studio (gVisor sandbox) |
Operator-crafted tests via Scapy + Go embed + PCAP replay (3 modes Art Studio UI) |
| MÓDULO KALI.Art ⭐ 2026-05-10 | kali-pod (gVisor sandbox) |
Full Kali Linux pen-test distro (600+ tools, browser terminal, Team+ tier) |
| MÓDULO HAR.Art ⭐ 2026-05-10 | har-replay |
Application-layer HAR replay (10k sessions/host scale, WAF tuning, regression) |
| MÓDULO VALIDATOR.Art ⭐ 2026-05-10 | validator-api + ML cortex sidecar |
Central enrollment, role assignment, ML-driven orchestration (cloud-portable) |
| MÓDULO TREX.Art ⭐⭐ 2026-05-10 | trex-pod (DPDK + hugepages) |
Cisco TRex stateful traffic gen — line-rate TCP/UDP/IPSec (30Mpps/core, Team+) |
| MÓDULO SPAN.Art ⭐⭐ 2026-05-10 | span-collector (libpcap → AF_XDP → DPDK → SmartNIC tiers) |
Line-rate packet capture / mirror — JA3/JA4 + TLS metadata + DUT validation |
| MÓDULO RELAY.Art ⭐⭐ 2026-05-10 | relay-bridge (HA pair) |
Bridge OOBI ↔ customer MGMT — telemetry ingress + control egress (read-only default) |
| MÓDULO GATEWAY.Art ⭐⭐ 2026-05-10 | gateway-proxy (operator entry) |
LDAP+SAML+passkey + RBAC + audit; OBP reverse-tunnel acceptor |
Total: 37 MÓDULOs.Art mapped (28 original + 9 added 2026-05-10).
Detailed mapping in memo feedback_module_naming_2026_05_09 and
post-Scope-Freeze additions in discuss_module_*_2026_05_10.
3-plane administrative classification (locked 2026-05-10)¶
The 37 MÓDULOs are administratively split into 3 planes based on
function (see memo project_module_planes_classification_2026_05_10):
| Plane | Count | MÓDULOs |
|---|---|---|
| DATA PLANE | 22 | ISP, SDWAN CoR-{1..10} (10), VXLAN-{1..3} (3), PW, synthetic-load engine, GO, IPr, DoYour, KALI, TREX, HAR |
| CONTROL PLANE | 5 | BGP-{1..4} (4), OSPF |
| MGMT PLANE | 10 | CLONER, FLOW, SYSLOG, SNMP, API INFRA, CLI, CA, VALIDATOR, RELAY, GATEWAY + PERSONAS controller (hybrid) |
Storage gravity principle: MGMT MÓDULOs with heavy storage (SYSLOG, FLOW, CLONER cache, Infra Stack TSDB+reports) are MANDATORY on-prem regardless of cloud-split mode. MGMT-light (CA, API INFRA, CLI, SNMP, VALIDATOR, RELAY, GATEWAY) are cloud-portable.
7 canonical Test Kinds (locked post-Scope-Freeze)¶
| Kind | Purpose |
|---|---|
tls-throughput |
Legacy two-leg TLS decryption throughput |
branch-office |
Free-form Mbps/Gbps WAN-only asymmetric |
inspection-profile |
5 named profiles (minimal/balanced/paranoid/compliance/sandbox) |
sdwan-cor |
SDWAN/Cloud On-Ramp + DIA |
bgp-saturation |
Control-plane RIB stress (real Internet snapshot or synthetic) |
mac-arp-stress |
L2 capacity stress (table fill) |
pure ⭐ |
Production URL Replay Engine (real-world site replay against DUT) |
OOBI orchestration plane — VXLAN fabric (locked 2026-05-10, immutable canon)¶
In multi-node deployments, the Infra Stack (Dashboard + Postgres + Prometheus + Grafana + MÓDULO API INFRA.Art + MÓDULO CLI.Art) reaches each MÓDULO via OOBI on the existing VLAN 99 mgmt segment. Single-node deployments use the same VLAN as in-host bridge — zero infra change.
OOBI is canonically immutable (per ADR 0019, locked 2026-05-10):
- IPv4: 10.254.254.0/24 (VLAN 99)
- IPv6: fd5a:7c5e:a72:0::/64 (ULA)
- VXLAN VNI: 254254 over UDP/4789 (HER underlay)
- 3-phase bootstrap: Cabling → mDNS Discovery → Overlay up
Subnet 10.254.254.0/24 (VLAN 99) carve-out:
| Range | Function |
|---|---|
.1-.3 |
Existing: Nexus 9000 SVI gateway, Nexus MGMT0, NGFW management interface (intact) |
.4-.49 |
Reserved DUT mgmt expansion + existing infra (snmp_exporter, etc.) |
.50-.79 |
MGMT-light MÓDULOs (CA .80, DoYour .81, KALI .82, HAR .83, VALIDATOR .84) |
.80-.84 |
Special slots locked 2026-05-10 |
.85-.99 |
Spare for future MGMT-light MÓDULOs |
.100 |
Infra Stack VIP (orchestration source) |
.220 |
MÓDULO TREX.Art (DATA plane data-leg, MGMT entry on OOBI) |
.230 |
MÓDULO SPAN.Art (line-rate capture, MGMT entry on OOBI) |
.240-.241 |
MÓDULO RELAY.Art (HA pair — bridge OOBI ↔ customer MGMT) |
.250 |
MÓDULO GATEWAY.Art (operator entry proxy — LDAP+SAML+passkey) |
NetworkPolicy enforces that only 10.254.254.100 (Infra Stack VIP)
can initiate connections to MÓDULOs OOBI IPs. Lateral movement
(MÓDULO ↔ MÓDULO via OOBI) is blocked by default. DUTs and
customer-side switches NEVER join the OOBI overlay — RELAY.Art
bridges to them via dedicated MGMT NICs (read-only default).
DOM — DUT Operating Mode (locked 2026-05-10)¶
Per ADR 0014 (DOM family) the bench's behavior is gated by 5 discriminator modes:
| Mode | Description | Production-blocking workflows? |
|---|---|---|
greenfield |
New DUT, no production traffic | none |
staging |
Pre-prod mirror | warn-only |
lab |
Dedicated lab DUT | warn-only |
production |
Live customer DUT | all destructive ops blocked by DDPB chain |
prod-partition |
Production but operator-quarantined slice | partial — destructive ops require explicit unlock |
DOM family includes: CAE (Conditional Automation Engine), CPOS (Customizable Profile / Override Stack with atomic 2-phase commit), PIE family (RSM/IR/HID/AAE), DDPB (Defense-in-Depth Production Blocking, 7-layer chain), PDD (Production Drift Detection), SPP (Smart Profile Predictor — ML cortex integration), PIE-PA (3-layer defense for PURE in production: pod scale-to-0 + BGP withdraw + DNS sanity).
PURE — Production URL Replay Engine (locked 2026-05-10)¶
Test Kind #7 (added post-Scope-Freeze, 17 PRs). Replays real customer URLs through the DUT for production-mode validation.
- Discovery Hub ingests from 8 sources: Syslog · Vendor API · PCAP · HAR · Curated · SPAN · Cloud-derived · KALI nmap import
- PVI (Pre-flight Validation, Ingestion-time): CLONER fn #9 ephemeral synthetic-load engine/PW probe pods, 3-stage cascade, internet-direct via CLONER egress
- PVP (Pre-flight Validation, Pre-test): DUT-delta scope check before each run
- PIE-PA 3-layer defense MANDATORY in production (avoids MITM
risk from personas on real public IPs
200.130.x.x) - Pre-curated Tranco/Umbrella/Majestic monthly refresh + air-gap fallback + per-test audit trail
ZTP-prem — Zero-Trust-on-Premises posture (locked 2026-05-12)¶
Zero-Trust-on-Premises is the 12-layer security posture that
defends the bench against an insider operator with kubectl and root
credentials inside the customer's own organisation. Closed at 12 of
12 layers operationally with v3.7.0. Canonical reference page:
docs/SECURITY_ZTP_PREM.md.
| # | Layer | Status | Reference |
|---|---|---|---|
| 01 | Cloud HSM custody | runtime gate shipping; real HSM probe scheduled | dashboard/src/lib/license/hsm-heartbeat.ts |
| 02 | Confidential Computing detection | shipping (per-node DaemonSet) | pkg/ztp-prem-detect/ |
| 03 | TPM measured boot | probe scaffold shipping | pkg/ztp-prem-tpm/ |
| 04 | Sealed audit hash-chain | shipping with replay verifier | dashboard/src/lib/license/sealed-audit.ts |
| 05 | K8s admission webhook | audit + enforce + break-glass + chain cross-correlation | pkg/ztp-prem-admission/ · k8s/ztp-prem/admission-webhook.yaml |
| 06 | Anti-debug runtime | distroless + readonly-rootfs + dropped caps | every Tier B Dockerfile |
| 07 | Tier A/B partition | policy YAML + CI lint | platform/ztp-prem/tier-policy.yaml |
| 08 | UTXO token vault | shipping (notes-not-balances) | dashboard/src/lib/license/utxo.ts |
| 09 | Tier B obfuscation | CI gate shipping (garble matrix) | scripts/ztp-prem-obfuscate-tier-b.sh |
| 10 | DLP egress monitor | shipping (5 rules + redaction policy) | dashboard/src/lib/dlp/ |
| 11 | Behavioural anomaly detector | rule-based 4-detector shipping | dashboard/src/lib/ztp-prem/anomaly-detector.ts |
| 12 | Separation of duties | CODEOWNERS + nightly SoD audit + policy memo | .github/CODEOWNERS · platform/ztp-prem/SEPARATION-OF-DUTIES.md |
Three market-differentiated primitives (Patent claims #18–#25):
- Cross-language signing contract (Patent #18) — Go
(
pkg/ztp-prem-signctl/canonical.go) and Node (dashboard/src/lib/license/envelope.ts) produce byte-identical 295-byte canonical envelope signatures, including post-quantum primitives (ML-KEM-768 + ML-DSA-65). Verified across runtimes. - UTXO token vault (Patent #19) — tokens are tamper-evident
notes, not a mutable balance. No
balancecolumn in the schema. No race condition possible because no row is ever updated. - Admission ↔ sealed-audit cross-correlation (Patent #20) — K8s
admission decisions are lifted from the webhook's local ring
buffer into the sealed audit hash-chain via
dashboard/src/lib/ztp-prem/admission-correlate.ts. Pod restarts do not erase forensic evidence.
Operator-visible surface — /admin/ztp-prem page ships
7 cards in v3.7.0: Confidential Computing status · Tier A/B
policy · License envelope summary · Envelope import · Sealed audit
replay verifier · DLP egress monitor · Admission audit (with mode
banner + DENIED/BREAK-GLASS badges + Correlate-to-sealed-audit
button). Four admin endpoints back the cards plus three new ones
landed in v3.7.0: /admission-audit · /admission-correlate ·
/hsm-heartbeat · /anomalies.
Integration with the rest of the architecture:
- ZTP-prem does not change the bench's test execution. It
surrounds it with auditability + admission policy + key custody.
- The license authorize() gate is the chokepoint where ZTP-prem
signals are consumed (HSM heartbeat freshness, UTXO note
availability, sealed audit append, admission decision recorded).
- All ZTP-prem env flags default OFF / audit-only on v3.7.0 —
enforcement flips (admission enforce mode, ZTP_PREM_HSM_REQUIRED=true)
are explicit operator actions following the canary rollout guides
embedded in the K8s manifest comments + the operator playbook in
docs/SECURITY_ZTP_PREM.md.
Stack diagram¶
flowchart LR
subgraph operator["Operator"]
UI["Next.js dashboard<br/>App Router · Tailwind · Drizzle"]
end
subgraph k8s["Kubernetes (or docker-compose)"]
DASH["dashboard pods<br/>HMAC session · audit log<br/>SSE stream · Prometheus /api/metrics"]
AGENTS["browser-engine agents (1..300)<br/>browser engine<br/>per-target circuit breaker<br/>idempotency keys · TLS 1.3"]
K6AGENTS["synthetic-load agents (1..1000)<br/>grafana/k6 binary<br/>p50/p95/p99 latency<br/>error rate · throughput"]
PG["postgres 16<br/>cluster_config · agents · runs<br/>k6_agents · k6_targets · k6_runs<br/>run_resources · metrics_buckets<br/>audit_log · idempotency"]
subgraph ztpprem["ZTP-prem 12/12 camadas (v3.7.0)"]
ADM["admission webhook<br/>audit · enforce · break-glass"]
SEAL["sealed audit hash-chain<br/>WORM + cross-correlation"]
DLP["DLP egress monitor<br/>5 baseline rules · observe-only"]
VAULT["UTXO token vault<br/>LICENSE.Art envelope"]
end
end
subgraph external["Public Internet"]
SITES["Target HTTPS sites<br/>(g1, uol, NASA, Cisco …)"]
end
OP[(Browser)] -->|"/login · cookie"| UI
UI <--> DASH
DASH <-->|"/api/agents/* · runs/*<br/>Bearer token + Idempotency-Key"| AGENTS
DASH <-->|"/api/k6/agents/* · k6/runs/*<br/>Bearer token"| K6AGENTS
DASH -->|"SQL"| PG
DASH -.->|"every admit / spend / fetch"| ADM
ADM -.-> SEAL
DASH -.->|"every egress"| DLP
DLP -.-> SEAL
DASH -.->|"licence check"| VAULT
AGENTS -->|"navigation, full page load,<br/>per-target h2/h3 + profile"| SITES
K6AGENTS -->|"k6 run (VUs, duration)<br/>HTTP/1.1 · HTTP/2 · HTTP/3"| SITES
AGENTS -.->|"protocol, bytes, TCP open,<br/>tls version, web vitals"| DASH
K6AGENTS -.->|"p50/p95/p99, error_rate,<br/>RPS, data_received"| DASH
DASH -->|"Server-Sent Events"| OP
classDef block fill:#fff,stroke:#000,stroke-width:1px
classDef ext fill:#f5f5f7,stroke:#000,stroke-width:1px
classDef ztp fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
class DASH,AGENTS,K6AGENTS,PG block
class SITES,UI ext
class ADM,SEAL,DLP,VAULT ztp
ZTP-prem layer (v3.7.0): every dashboard action that touches a Tier-classified workload, spends a licence note, or issues outbound network egress flows through the admission webhook / licence vault / DLP wrapper respectively, and every decision is recorded into the sealed audit hash-chain (WORM, tamper-evident). See SECURITY_ZTP_PREM.md and the ZTP-prem ADR family.
Component breakdown¶
browser-engine agent (agent/)¶
- Stateless Linux + headless Chromium worker, distributed via the
official
mcr.microsoft.com/playwrightbase image. - Per-cycle lifecycle:
register(idempotent, retried with exp backoff + jitter);fetchConfig(everyPOLL_INTERVAL_MS);- for each tick, pick the target whose
next-run-atis earliest (bursting targets jump the queue); - open a fresh
BrowserContext(no cookies / cache / IndexedDB reuse), navigate, parse subresources, capture protocol /tcp_sockets_open/mbps_avg; - close the context, wipe the on-disk profile, schedule the next run for that target.
- Browser pool keyed by
(httpVersion, host)so HTTP/3 forcing (--enable-quic --origin-to-force-quic-on=…) recycles minimally. - Sentinel relaunches Chromium after N consecutive browser-level failures; per-target circuit breaker pauses misbehaving targets.
CYCLE_CONCURRENCYruns N parallel cycles inside one Chromium for per-replica throughput.
synthetic-load agent (k6-agent/)¶
- Stateless TypeScript + Node.js process that wraps the official
grafana/k6binary (copied from thegrafana/k6:0.56.0base image at build time — no Alpine package dependency). - Per-cycle lifecycle:
registerwith the dashboard (idempotent upsert);heartbeateveryHEARTBEAT_INTERVAL_MS;- poll
/api/k6/agents/{id}/configeveryPOLL_INTERVAL_MS; - for each enabled target whose
next-run-atis due, POST/api/k6/runs/start, write a k6 script to/tmp, spawnk6 run --summary-export /tmp/summary.json; - parse the summary JSON and POST
/api/k6/runs/completewithp50/p95/p99,error_rate,rps,data_received. /healthzHTTP probe on port 9090.- K8s-ready:
readOnlyRootFilesystem: true;/tmpand/home/k6agentareemptyDir: medium: Memoryvolumes injected at pod creation.
Dashboard (dashboard/)¶
- Next.js 15 App Router (TypeScript). All API routes are
force-dynamicand emitcache-control: no-store. - Drizzle ORM over
postgres-js. Lazy initialisation via a Proxy sonext buildworks without a live database. - Authentication: HMAC-signed HttpOnly cookie on
/login; legacyAuthorization: Basicheader is still accepted for programmatic callers. Constant-time comparison; rate-limited token bucket per client IP. - Server-Sent Events (
/api/dashboard/stream) push atickevent wheneverruns.{started_at,finished_at,total}changes; React Query invalidates caches on every tick. 30-second polling fallback for tunnels that strip event-stream framing. - Prometheus endpoint at
/api/metrics, writing both in-memory counters (runs_completed_total,runs_started_total,runs_by_protocol_total,reaper_*) and live gauges (agents_live,runs_per_minute,cycle_avg_ms). - Dark mode via
classstrategy; i18n PT/EN/ES with persistence tolocalStorage.
Database¶
- Postgres 16, tunable via env (compose default:
shared_buffers=256MB,work_mem=8MB,max_connections=200). - Twenty-three idempotent migrations (
0000…0023) covering the full schema evolution. Optional migration0004enables TimescaleDB hypertable + continuous aggregate when the extension is available. Migration0014adds the synthetic-load engine tables; migrations0018–0023extend the schema for clone-persona slots (0018), license acceptance (0019), the test-plan catalog (0020), internal revision tracking (0021), and DUT API device registry + encrypted-credential storage + sanitized snapshot tables (0023). audit_logrecords every admin mutation (actor, source IP, before / after JSON).idempotencydeduplicates retried POSTs and is GC'd by the reaper after 30 minutes.
Local-scaler (docker-compose.scaler.yml, opt-in by default in .env)¶
- Mounts
/var/run/docker.sockinto the dashboard container. - The dashboard talks to the Docker Engine API over the unix socket and
reconciles
cluster_config.desired_agent_countwith running containers labelledcom.docker.compose.service=agent. - Runs the dashboard as the
dashboarduser joined to the host'sdockergroup viagroup_add.scripts/init-env.shdetects the rightDOCKER_GIDfor the host.
Kubernetes manifests (k8s/)¶
- Namespace, Postgres
StatefulSet, dashboardDeployment(withtopologySpreadConstraintsandPodDisruptionBudget), browser-engine agentDeploymentplusHorizontalPodAutoscalermin=1, max=300based on CPU/memory. - synthetic-load agent
Deploymentplus HPAmin=0, max=1000(scale-to-zero requires KEDA or Kubernetes ≥ 1.32); PDBmaxUnavailable: 10%. Volumes:emptyDir: medium: Memoryfor/tmp(256 Mi) and/home/k6agent(64 Mi) —readOnlyRootFilesystem: true. - Two
NetworkPolicyresources — one for browser engine, one for synthetic-load engine — each with default-deny ingress and an explicit egress allow-list (DNS, dashboard, public TCP 80/443, UDP 443 for QUIC; private CIDRs blocked). - Daily Postgres
pg_dumpCronJobwriting to a 50 GiB PVC, 14-day retention. PrometheusRulewith five alerts: no agents live, agents below desired, error rate > 20 %, slow cycles, reaper churn.
DUT API engine (dashboard/src/lib/dut-api/)¶
A thin vendor-adapter layer that captures sanitized config + identity from the
Device Under Test and persists hashed snapshots for every test run. Four
adapters ship today, each implementing a common DutAdapter interface:
| Adapter file | Vendor | Transport | Captures |
|---|---|---|---|
cisco-ftd.ts |
Cisco Firepower (FTD via cdFMC / FMC) | REST | Model, serial, version, decrypt-policy state, sanitized running-config |
cisco-nexus.ts |
Cisco Nexus 9000 | NX-API (HTTPS) | Model, serial, version, interface counters, port-channel state |
cisco-ucs-cimc.ts |
Cisco UCS chassis | Redfish (HTTPS) | Hardware inventory, PSU, fans, temp, BIOS |
fortinet-fortigate.ts |
FortiGate | REST (FortiOS) | Model, serial, version, sanitized config, session-table summary |
encryption.ts wraps device credentials with AES-GCM at rest. poller.ts runs
a 60 s liveness loop; failures surface on /admin/dut-api. snapshot-storage.ts
canonicalizes the snapshot JSON, hashes it (SHA-256), and persists both blob +
hash so the report-builder can embed Annex B/C/D and prove the device state at
run start. registry.ts is the device-registry CRUD with integration into
audit_log (every register / update / delete is recorded).
Test-plan + test-run + pre-flight engines¶
Three small modules, kept independent:
- Test-plan catalog —
platform/test-plans/catalog.yamlis loaded at Dashboard startup intotest_plans(0020). Each plan has an immutableidentifier(e.g.CAP-FIND-KNEE-30M) plusphases[]andrequirements{}. - Test-run executor —
dashboard/src/lib/test-runs/runs through phases: preflight gate → snapshot DUTs → freeze plan (planSnapshotSha256) → fan-out runs to agents → collect results → writereport.json→ generate the print-styled HTML report. - Pre-flight engine —
dashboard/src/lib/preflight/runs a 5-check catalog (subnet conflict OOBI ↔ ISP ↔ persona VLANs, NGFW reachability + decrypt mode matches plan, persona PKI freshness, NTP relay clock skew within tolerance, DUT API auth + snapshot succeeds). Failures block run start; bypass writes toaudit_logand shows on the report cover.
Multi-node overlay (overlays/multi-node/)¶
For deployments across multiple physical servers, the Kustomize overlay at
overlays/multi-node/ applies nodeSelector patches to each
Deployment/StatefulSet so each workload tier is pinned to its dedicated server:
| Patch file | Target workload | Pinned to |
|---|---|---|
patch-playwright-nodesel.yaml |
web-agent Deployment |
role=playwright (UCS-2) |
patch-k6-nodesel.yaml |
k6-agent Deployment |
role=k6 (UCS-3) |
patch-infra-nodesel.yaml |
dashboard, postgres, pgbouncer, cloner |
role=infra (UCS-4) |
patch-cloner-nodesel.yaml |
cloner Deployment |
role=infra (UCS-4) |
The 30 persona webservers (Caddy) — 20 Synthetic Personas (VLANs 101–120) plus
10 Cloned Persona slots (VLANs 200–209) — are pinned to UCS-1 (role=ngfw-dut)
via the DUT overlay. The tuning DaemonSet is patched post-deploy to use
dut-data-plane=true so it runs on UCS-1/2/3 but not UCS-4.
Public Website Cloner (k8s/80-83-cloner-*.yaml)¶
The cloner is an optional Deployment (1 replica) that downloads real public websites
using headful browser engine with stealth, and serves the static mirror at
http://clone-serve.web-agents.svc.cluster.local:8081/{personaName}/.
Dual-NIC architecture — VLAN 40:
eth0 (OOBI / K8s default) ──► Dashboard API, heartbeat, Prometheus scrape
net1 (VLAN 40 / Multus) ──► Internet via eth1.40 — HTTP/2, HTTP/3, DNS, ICMP
DHCP IP from upstream ISP/lab router
NOT routed through the NGFW
| Interface | VLAN | Finalidade |
|---|---|---|
| eth0 | — | OOBI: K8s control plane, Dashboard, CoreDNS |
| net1 | 40 | ISP: internet egress direto (sem NGFW) |
The cloner-isp NAD attaches a macvlan on eth1.40 (VLAN 40 on the Nexus 9000 trunk).
An initContainer configures policy routing: all non-RFC1918 traffic is fwmarked to
routing table 100 (default via ISP gateway on net1). DNS is forced to 8.8.8.8 /
208.67.222.222 regardless of DHCP, plus 10.96.0.10 (k3s CoreDNS) for internal names.
HTTPS and public CA certificates:
The cloner accesses public HTTPS sites directly (no NGFW interception). Chromium validates
certificates against the Debian ca-certificates bundle (Mozilla NSS roots). No special
configuration is needed for standard public sites. The NGFW CA is not injected into the
cloner — it is not needed because the cloner bypasses the NGFW entirely via VLAN 40.
Internet health monitor: pings 8.8.8.8, 1.1.1.1, and the DHCP gateway every 10 s.
Exposes Prometheus metrics: cloner_internet_up{target}, cloner_internet_any_up,
cloner_gateway_up{gateway}, cloner_ping_rtt_ms{target}.
Storage: 20 Gi local-path PVC (cloned-sites) at /mnt/cloned/{personaName}/.
Served by the clone-serve ClusterIP Service on port 8081.
Multi-node: pinned to role=infra (UCS-4) — same node as Dashboard.
VLAN 40 must be allowed on the trunk port from Nexus 9000 to UCS-4.
See CLONER_OPERATIONS.md for deployment, job management, Grafana monitoring, and full troubleshooting guide.
Apply with:
kubectl apply -k overlays/multi-node/
See UBUNTU_K3S_MULTINODE_QUICKSTART_DEPLOY.en.md for the full guide.
Data flow per cycle¶
browser-engine agent¶
sequenceDiagram
participant A as Agent (Chromium)
participant D as Dashboard
participant P as Postgres
A->>D: POST /api/agents/register (Bearer)
D->>P: UPSERT agents
A->>D: GET /api/agents/{id}/config
D->>P: SELECT cluster_config + active targets
loop every cycle
A->>D: POST /api/runs/start (Idempotency-Key)
D->>P: INSERT runs
D-->>A: { runId }
A->>A: navigate + capture metrics
A->>D: POST /api/runs/complete (Idempotency-Key=runId)
D->>P: UPDATE runs · INSERT run_resources
D->>P: UPSERT metrics_buckets (60s)
end
A-->>D: POST /api/logs (heartbeat + cycle events)
D-->>P: INSERT logs
synthetic-load agent¶
sequenceDiagram
participant K as synthetic-load agent (Node.js)
participant D as Dashboard
participant P as Postgres
participant B as k6 binary
K->>D: POST /api/k6/agents/register (Bearer)
D->>P: UPSERT k6_agents
K->>D: GET /api/k6/agents/{id}/config
D->>P: SELECT k6_targets (enabled)
loop every target interval
K->>D: POST /api/k6/runs/start
D->>P: INSERT k6_runs
D-->>K: { runId }
K->>B: spawn k6 run --summary-export /tmp/summary.json
B-->>K: exit 0 + JSON metrics
K->>D: POST /api/k6/runs/complete (p50/p95/p99, error_rate, rps)
D->>P: UPDATE k6_runs
end
K->>D: POST /api/k6/agents/{id}/heartbeat (every 30 s)
D->>P: UPDATE k6_agents.last_seen_at
DUT test-bed topology (physical NGFW mode)¶
When scripts/stack-up.sh up-dut is used, the cluster is extended with a
physical NGFW Device Under Test (DUT) connected over 802.1q VLAN trunks.
The NGFW is the L3 default gateway for all three test-plane VLANs.
Physical + virtual topology¶
flowchart TB
%% ═══════════════════════════════════════════════════════════════════════
%% CISCO UCS — Ubuntu Linux host
%% ═══════════════════════════════════════════════════════════════════════
subgraph ucs["Cisco UCS — Ubuntu Linux host"]
direction TB
ETH1(["eth1\n802.1q trunk NIC"])
subgraph vlans["Kernel: 802.1q VLAN subinterfaces"]
direction LR
V20["eth1.20\nVLAN 20"]
V30["eth1.30\nVLAN 30"]
V40["eth1.40\nVLAN 40\nCloner ISP"]
V99["eth1.99\nVLAN 99 mgmt"]
end
subgraph nets["Docker networks (Docker mode) / K8s NADs (K8s mode)"]
direction LR
MPW["ai_forse_dut_pw\nmacvlan · 172.16.0.0/16\ngw 172.16.0.1"]
MK6["ai_forse_dut_k6\nmacvlan · 172.17.0.0/16\ngw 172.17.0.1"]
MMGMT["ai_forse_mgmt\nmacvlan · 10.254.254.0/24\ngw 10.254.254.1"]
OOBI(["ai_forse_oobi\nbridge --internal\ncontrol plane"])
end
subgraph ctrs_data["Data-plane containers"]
direction LR
PW["browser engine\nagents\n:8080/healthz"]
K6A["synthetic-load engine\nagents\n:9090/healthz"]
end
subgraph ctrs_obs["Observability containers"]
direction LR
SNMPEXP["snmp_exporter\n:9116"]
PROM["Prometheus\n:9090"]
GRAF["Grafana\n:3001"]
end
subgraph ctrs_ctrl["Control-plane containers"]
direction LR
DASH["Dashboard\n:3000"]
PG["Postgres\n:5432"]
end
%% kernel bindings
ETH1 --- V20 & V30 & V40 & V99
%% macvlan ↔ container bindings
V20 --> MPW --> PW
V30 --> MK6 --> K6A
V99 --> MMGMT --> SNMPEXP
%% oobi control plane (all containers share this bridge)
OOBI --- PW
OOBI --- K6A
OOBI --- SNMPEXP
OOBI --- PROM
OOBI --- GRAF
OOBI --- DASH
OOBI --- PG
end
%% ═══════════════════════════════════════════════════════════════════════
%% CISCO NEXUS 9000
%% ═══════════════════════════════════════════════════════════════════════
subgraph nexus["Cisco Nexus 9000 — NX-OS 10.5(4)M"]
direction TB
NTRUNK["Trunk port\nallowed VLAN 20,30,99,101-120,200-209"]
NA20["Access port\nVLAN 20"]
NA30["Access port\nVLAN 30"]
NMGMT0(["MGMT0\n10.254.254.2"])
end
%% ═══════════════════════════════════════════════════════════════════════
%% NGFW DUT
%% ═══════════════════════════════════════════════════════════════════════
subgraph dut["NGFW DUT — Device Under Test"]
direction TB
NI2["int-2 · VLAN 20\n172.16.0.1/16\nNet_Prod_browser engine"]
NI3["int-3 · VLAN 30\n172.17.0.1/16\nNet_prod_K6_Agents"]
NMGT(["Mgmt interface\n10.254.254.3/24"])
DTLS{{"TLS decryption engine\nintercept · inspect · re-sign"}}
NI2 --- DTLS
NI3 --- DTLS
end
%% ═══════════════════════════════════════════════════════════════════════
%% PHYSICAL CABLING
%% ═══════════════════════════════════════════════════════════════════════
ETH1 <-->|"802.1q trunk"| NTRUNK
NA20 <-->|"L2 · VLAN 20"| NI2
NA30 <-->|"L2 · VLAN 30"| NI3
NMGMT0 <-->|"OOB mgmt cable"| NMGT
%% ═══════════════════════════════════════════════════════════════════════
%% DATA PLANE — TLS test traffic path
%% browser-engine/synthetic-load → macvlan → Nexus → NGFW (decrypt)
%% → Nexus → macvlan → Caddy
%% ═══════════════════════════════════════════════════════════════════════
PW -->|"default route\n172.16.0.1"| NI2
K6A -->|"default route\n172.17.0.1"| NI3
%% ═══════════════════════════════════════════════════════════════════════
%% CONTROL PLANE — stays on ai_forse_oobi, never leaves host
%% ═══════════════════════════════════════════════════════════════════════
DASH <-->|"Bearer · SSE\nIdempotency-Key"| PW
DASH <-->|"Bearer · SSE\nIdempotency-Key"| K6A
DASH --- PG
%% ═══════════════════════════════════════════════════════════════════════
%% OBSERVABILITY — Prometheus → snmp_exporter → physical devices
%% ═══════════════════════════════════════════════════════════════════════
PROM -->|"HTTP :9116\ncisco_nexus\n→ 10.254.254.2"| SNMPEXP
PROM -->|"HTTP :9116\nSNMP_DUT_MODULE\n→ 10.254.254.3"| SNMPEXP
SNMPEXP -->|"SNMP UDP/161"| NMGMT0
SNMPEXP -->|"SNMP UDP/161"| NMGT
PROM -->|"HTTP :9100\nnode_exporter\nfile_sd"| ETH1
GRAF -->|"PromQL"| PROM
%% ═══════════════════════════════════════════════════════════════════════
%% STYLES
%% ═══════════════════════════════════════════════════════════════════════
classDef mac fill:#dbeafe,stroke:#2563eb,stroke-width:1px
classDef oobi fill:#fce7f3,stroke:#db2777,stroke-width:1px
classDef ctr fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
classDef sw fill:#f3f4f6,stroke:#374151,stroke-width:2px
classDef vlan fill:#fef9c3,stroke:#b45309,stroke-width:1px
classDef trunk fill:#fff7ed,stroke:#c2410c,stroke-width:1px
class MPW,MK6,MMGMT mac
class OOBI oobi
class PW,K6A,SNMPEXP,PROM,GRAF,DASH,PG ctr
class NTRUNK,NA20,NA30,NMGMT0,NI2,NI3,NMGT,DTLS sw
class V20,V30,V40,V99 vlan
class ETH1 trunk
Traffic paths summary¶
| Path | Route |
|---|---|
| Data plane (synthetic-load engine example, K8s mode) | synthetic-load engine pod (net1, VLAN 30) → trunk → Nexus (VLAN 30) → NGFW → TLS decrypt → NGFW outside → Nexus (VLAN 101-120 or 200-209) → Persona Caddy pod (net1) |
| Data plane (browser engine, K8s mode) | browser engine pod (net1, VLAN 20) → trunk → Nexus (VLAN 20) → NGFW → TLS decrypt → NGFW outside → Nexus (VLAN 101-120 or 200-209) → Persona/Cloned Persona Caddy pod (net1) |
| Cloner ISP | Cloner (net1) → eth1.40 (VLAN 40) → trunk → Nexus → upstream router → Internet (direct, no NGFW) |
| Control plane | Any container → ai_forse_oobi (bridge, --internal) → Dashboard — never leaves the host |
| SNMP | Prometheus → snmp_exporter :9116 (oobi) → ai_forse_mgmt (macvlan) → eth1.99 → trunk → Nexus MGMT0 (10.254.254.2) / NGFW mgmt (10.254.254.3) |
| node_exporter | Prometheus → 10.254.254.x:9100 (Ubuntu hosts, file_sd) → management VLAN |
DUT network segments¶
NGFW security zones locked 2026-05-09 — the NGFW has 8 interfaces (or sub-interfaces, depending on hardware) split across 3 zones:
| VLAN | Interface host | NGFW zone | Subnet | Purpose |
|---|---|---|---|---|
| 20 | eth1.20 / dut-pw NAD |
TRUST / INSIDE | 172.16.0.0/16 |
browser-engine agents (MÓDULO PW.Art) → through NGFW |
| 30 | eth1.30 / dut-k6 NAD |
TRUST / INSIDE | 172.17.0.0/16 |
synthetic-load agents (MÓDULO K6.Art) → through NGFW |
| 2900 | eth1.2900 (NAD pending) | TRUST / INSIDE | 172.19.0.0/16 (synthetic spoof pool: 10.96.0.0/11 + fd00:1234::/32) |
MAC + ARP/NDP table stress (MÓDULO GO.Art) → NGFW |
| 2901 | eth1.2901 (NAD pending) | TRUST / INSIDE | 172.20.0.0/16 |
MÓDULO DoYour.Art (Scapy/Go/PCAP sandbox) → NGFW |
| 2902 | eth1.2902 (NAD pending) | TRUST / INSIDE | 172.21.0.0/16 |
MÓDULO KALI.Art (Kali Linux pen-test pod) → NGFW |
| 2001 | eth1.2001 / edge-link-vyos NAD |
UNTRUST / OUTSIDE | 200.130.0.0/30 |
Edge link NGFW outside ↔ VyOS-ISP (NGFW=.1, VyOS=.2); the NGFW's default route |
| 1982 | eth1.1982 (NAD pending) | UNTRUST / OUTSIDE | 200.130.0.8/30 |
OSPF peer link — NGFW=.9, VyOS-OSPF-PEER=.10 |
| 2809 | eth1.2809 / bgp-peer-vlan2809 NAD |
UNTRUST / OUTSIDE | 200.130.0.12/30 + 2001:db8:0:2809::/126 ⚠️ |
BGP peer link (IP-migrated v4.3 — IPv6 still doc-prefix until v4.5+) |
| 4338 | eth1.4338 (NAD pending) | UNTRUST / OUTSIDE | 200.130.0.4/30 |
MÓDULO SDWAN/CoR-N.Art peer link (local mode) — NGFW=.5, VyOS peer=.6. Legacy term "VPN-REMOTE" refers to this same data-plane leg |
| 99 | eth1.99 / dut-mgmt + oobi-mgmt NADs |
MGMT | 10.254.254.0/24 |
SNMP probe band (host-local .10-.49) + OOBI orchestration (static .50-.82 MÓDULOs) |
VLANs that do NOT terminate on the NGFW:
| VLAN | Reason |
|---|---|
40 (cloner-isp NAD, DHCP from upstream) |
Bypass — cloner egress goes directly to public Internet, never through NGFW |
| 101–120 (Synthetic Persona NADs) | Personas live BEHIND the VyOS-ISP pod (one country /24 per VLAN); NGFW only sees the single edge link (VLAN 2001) — VyOS-ISP routes onward to each country |
| 200–209 (Cloned Persona slots, legacy) | Being deprecated — cloned personas migrate to country /24s alongside Synthetic ones |
Web test traffic flow (corrected for v4.3)¶
[ Agent pod (browser-engine/synthetic-load) ]
│ VLAN 20 or 30 (TRUST/INSIDE)
▼
[ NGFW — TLS decrypt LEG 1 ]
│ outside interface, VLAN 2001 (UNTRUST/OUTSIDE)
│ default route → 200.130.0.2 (VyOS-ISP)
▼
[ VyOS-ISP — country-LAN router ]
│ one outbound interface per country LAN
│ routes to <NREN-/24>.10..14 (5 personas per country)
▼
[ Persona Caddy pod — TLS LEG 2 ]
e.g. shop.us.persona.tlsstress.art = 198.32.10.10
Three personas-don't-talk-to-NGFW-directly consequences:
1. The NGFW interface inventory is 8 interfaces total (3 INSIDE + 4 OUTSIDE + 1 MGMT) — not 30+ as the legacy tables suggested.
2. Persona NADs (101–120) terminate on a per-country dummy interface inside the VyOS-ISP pod, not on eth1.10X host sub-interfaces facing the NGFW.
3. The legacy "NGFW gateway = 10.1.x.1" wording in older docs is incorrect for v4.3 — gateway is <NREN-/24>.1 on the VyOS-ISP pod. Cleanup queued separately (Personas v4.3 IP migration option B).
SNMP monitoring — supported vendors¶
The snmp_exporter container is dual-homed: OOBI (Prometheus scrapes it)
and ai_forse_mgmt macvlan (physical reach to devices).
| SNMP module | Device | Key metrics |
|---|---|---|
cisco_nexus |
Cisco Nexus 9000 | CPU/mem/temp/fan/PSU/CRC/queue drops |
cisco_ftd |
Cisco Firepower FTD 7.x | sessions, CPS, HW crypto engine, interfaces |
cisco_iosxe |
Cisco C8475-G2 IOS-XE | CPU/mem/temp/fan/PSU, queue drops |
cisco_meraki |
Meraki MX | IF-MIB only (Dashboard API needed for more) |
fortinet_fortigate |
FortiGate FG200F/FG600F | fgSysCpuUsage, fgSysSesCount, fgSysSesRate1 |
palo_alto |
PA-series PAN-OS | panSessionActive, panSysCpuUtilization, throughput |
checkpoint |
Check Point Quantum/R81 | fwNumConn, fwDropped, svnPerfCPU |
huawei_ngfw |
Huawei USG/NGFW | hwEntityCPUUsage, hwEntityMemUsage, temperature |
generic_ngfw |
Any vendor | Standard RFC MIBs (IF-MIB + HOST-RESOURCES-MIB) |
Set SNMP_DUT_MODULE in .env to select the module. The Nexus 9000 always
uses cisco_nexus.
Ubuntu host monitoring (Cisco UCS)¶
Physical Ubuntu hosts (Cisco UCS servers) are monitored via node_exporter
running on each host. Prometheus scrapes them using file-based service
discovery (observability/prometheus/targets/ubuntu-hosts.yml) — edit the
file to add/remove hosts; Prometheus reloads in 30 s, no restart required.
For Cisco UCS chassis-level sensors (CIMC temperature, fans, PSU), enable
SNMP on CIMC and add the CIMC management IP as a separate generic_ngfw
scrape job.
DUT Grafana dashboards¶
| Dashboard | UID | Description |
|---|---|---|
Nexus 9000 |
dut-nexus9000 |
Switch health: CPU, memory, temp, fans, PSU, CRC/FCS, queue drops, optical sensors |
NGFW DUT |
dut-ngfw |
DUT: throughput per VLAN, active sessions, CPS, HW crypto, CPU, memory, temp, CRC |
Ubuntu Hosts (UCS) |
dut-ubuntu-hosts |
OS-level: CPU, load, memory, disk I/O, filesystem, network, processes, Docker |
UCS CIMC — Chassis |
dut-ucs-cimc-chassis |
UCS chassis: PSU, fans, temp, BIOS — Redfish-sourced via the DUT API adapter |
UCS CIMC — Hardware |
dut-ucs-cimc-hardware |
UCS blade: CPU, memory, disk, network adapters, IOC controllers |
TLS Decrypt Mode |
decrypt-mode |
Issuer-cert ground-truth probe (ACTIVE / BYPASS), TLSDecryptModeChanged alert state |
Cloner health |
cloner |
ISP up, ping 8.8.8.8 / 1.1.1.1, gateway ping, RTT timeseries |
Security posture (high level)¶
For the complete 12-layer security posture (Zero-Trust-on-Premises), see
SECURITY_ZTP_PREM.mdand the § ZTP-prem section above. This section covers the operational baseline that applies to every deployment regardless of ZTP-prem layer-flip state.
- Containers: non-root, dropped Linux capabilities, RuntimeDefault seccomp, read-only root filesystem (where possible).
- Egress
NetworkPolicyblocks RFC1918 + link-local CIDRs from the agent fleet by default. SECURITY.mddefines reporting flow + SLA;audit_logrecords every config mutation;gitleaksblocks committed secrets in CI;trivyscans both images on every PR;CodeQLstatic analysis weekly.- Container images are pushed multi-arch and signed with Cosign keyless via the GitHub OIDC provider; SBOM (SPDX) attached as a registry attestation on every release tag.
- Pre-flight check engine runs a 5-check catalog before every test-run
start: subnet conflict (OOBI ↔ ISP ↔ persona VLANs), NGFW reachability +
decrypt mode matches plan, persona PKI freshness, NTP relay clock skew
within tolerance, DUT API auth + snapshot succeeds. Bypass is logged in
audit_logand printed on the Test Run Report cover. - DUT API credentials encrypted at rest with AES-GCM (
encryption.ts); sanitized snapshots hashed (SHA-256) and embedded in the Test Run Report Annex B/C/D so a reviewer can verify what device state held during the run. - License acceptance modal records every first-load session in
audit_logagainst theLicenseRef-PolyForm-Noncommercial-1.0.0-with-Appendix-Aidentifier; deployments can revoke session acceptance centrally if an audience-policy violation is suspected. - IP protection — all forensic identifiers (cert fingerprints, deployment
hashes, asset signatures, decrypt-mode timeline) live in a separate
private/forensicrepository with the project owner as sole collaborator; the public repo never contains identity-binding data. Rationale: GitHub does not support per-branch ACLs, so any in-tree forensic branch would leak to every collaborator withpullon the public repo. Seedocs/IP_PROTECTION.md.