TLSStress.Art — System Overview¶
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.
A comprehensive technical reference for engineers, operators, and new team members. This document explains what the system is, why it exists, and how every part fits together.
Table of Contents¶
- Project Purpose
- Two-Leg TLS Architecture
- Physical / Logical Topology
- browser engine Agents
- synthetic-load engine Agents
- Synthetic Personas (20 webservers)
- Cloned Personas (10 slots)
- The Cloner
- Certificate Orchestration (PKI)
- Dashboard (Orchestration)
- Observability
- Performance Tuning
- Operational Subsystems (DUT API, Test Plans, Reports, Pre-flight, Time-sync, Syslog)
- Onboarding & IP Protection
- Production Deployment Steps
- Key File Reference Table
1. Project Purpose¶
TLSStress.Art is a lab test-bed designed to measure the performance of a Next-Generation Firewall (NGFW, the Device Under Test — DUT) when performing TLS decryption and inspection at scale.
- Primary protocol under test: HTTP/3 over QUIC/UDP 443.
- Secondary protocol: HTTP/2 over TCP 443.
- Workload mix: real browser traffic (browser engine + headless Chromium) plus synthetic high-volume HTTP load (synthetic-load engine).
- Scope: lab only. No production data, no public Internet exposure for the agent fleet, no security requirements beyond what is needed for the test to be representative.
The system intentionally exercises the most CPU-expensive code paths inside the NGFW — full TLS handshake on every connection, no session resumption, modern AEAD ciphers — so that performance measurements reflect worst-case inspection cost.
2. Two-Leg TLS Architecture¶
Every request is encrypted twice on its way through the NGFW. The NGFW terminates the inbound TLS session, inspects the cleartext, and re-originates a new TLS session toward the webserver. This is the workload that consumes NGFW CPU and is what we want to measure.
+------------------+ TLS Leg 1 +-----------+ TLS Leg 2 +-------------------+
| | agent-side | | server-side | |
| Agent | ------------> | NGFW | ------------> | Persona Caddy |
| (browser-engine/synthetic-load) | decrypt | (DUT) | re-encrypt | (webserver) |
| | inspect | | | |
| trusts: ngfw-ca | | | | cert: persona-ca |
+------------------+ +-----------+ +-------------------+
trusts:
persona-ca
| Leg | Direction | Server cert presented by | Client trusts |
|---|---|---|---|
| Leg 1 | Agent → NGFW | ngfw-ca-issued |
ngfw-ca ConfigMap (mounted in agent) |
| Leg 2 | NGFW → Webserver | persona-ca-issuer |
persona-ca (imported into the NGFW) |
Why two legs: the NGFW must decrypt Leg 1, parse and inspect the cleartext payload, then re-encrypt onto Leg 2. The cost of this decrypt-inspect-encrypt cycle, multiplied by the request rate, is what the test measures.
3. Physical / Logical Topology¶
3.1 VLAN Map¶
| VLAN | Purpose | Subnet | Gateway |
|---|---|---|---|
| 20 | browser-engine agents | 172.16.0.0/16 |
172.16.0.1 (NGFW) |
| 30 | synthetic-load agents | 172.17.0.0/16 |
172.17.0.1 (NGFW) |
| 40 | Cloner ISP egress | DHCP from upstream | upstream router |
| 99 | SNMP management | 192.168.90.0/24 |
management gateway |
| 101–120 | Synthetic Personas (20) | 10.1.{1..20}.0/27 |
NGFW (10.1.x.1) |
| 200–209 | Cloned Personas (10 slots) | 10.2.{1..10}.0/27 |
NGFW (10.2.x.1) |
3.2 Agent Routing¶
Defined in k8s/dut/20-network-attachments.yaml:
eth0— Kubernetes OOBI (Out-Of-Band Infrastructure). Default route. Used to reach Dashboard, Prometheus, CoreDNS, Postgres. Never routed through the NGFW.net1— macvlan attachment on the Persona VLANs. Carries traffic to:10.1.0.0/16via the NGFW (Synthetic Personas)10.2.0.0/16via the NGFW (Cloned Personas)
+--------------------+
| K8s OOBI fabric |
| Dashboard, Promu, |
| CoreDNS, Postgres |
+---------+----------+
| eth0 (default)
|
+--------------+---------------+
| Agent pod |
| (browser engine or synthetic-load engine) |
+--------------+---------------+
| net1 (macvlan)
|
VLAN 20 / 30
|
+-------+-------+
| NGFW |
| (DUT) |
+-------+-------+
|
+------------------+------------------+
| |
VLANs 101-120 VLANs 200-209
Synthetic Personas Cloned Personas
3.3 Deployment Topologies¶
Four Linux+Kubernetes layouts are supported. All four are wired by scripts/k8s-install.sh (see the dedicated quickstart guides for the full procedure per language).
| Mode | UCS count | OOBI on every UCS | browser engine | synthetic-load engine | Personas + services + observability | Overlay | Guide |
|---|---|---|---|---|---|---|---|
| Single-node | 1 | ✓ (eth0) | same UCS | same UCS | same UCS | none — base k8s/ + k8s/dut/ |
UBUNTU_K3S_SINGLENODE_QUICKSTART_DEPLOY.{en,es,pt-BR}.md |
| Dual-node | 2 | ✓ on UCS-1 + UCS-2 (eth0) | UCS-1 (role=agents) |
UCS-1 (role=agents) |
UCS-2 (role=ngfw-dut) |
overlays/dual-node/ |
UBUNTU_K3S_DUALNODE_QUICKSTART_DEPLOY.{en,es,pt-BR}.md |
| Tri-node | 3 | ✓ on UCS-1..3 (eth0) | UCS-1 (role=playwright) |
UCS-2 (role=k6) |
UCS-3 (role=ngfw-dut) |
overlays/tri-node/ |
UBUNTU_K3S_TRINODE_QUICKSTART_DEPLOY.{en,es,pt-BR}.md |
| Multi-node | 4 | ✓ on UCS-1..4 (eth0) | UCS-2 (role=playwright) |
UCS-3 (role=k6) |
UCS-1 personas (role=ngfw-dut) + UCS-4 services (role=infra) |
overlays/multi-node/ |
UBUNTU_K3S_MULTINODE_QUICKSTART_DEPLOY.{en,es,pt-BR}.md |
Key invariants across modes:
- OOBI (eth0) is mandatory on every UCS. It carries the k3s API, kubelet, flannel CNI, and Prometheus scrape. Without it the node cannot join the cluster.
- The
node_exporterDaemonSet has nonodeSelectorand runs on every node — host metrics coverage is identical in all four modes. - The
node-tuningDaemonSet is patched todut-data-plane=trueafter apply — every UCS that owns a data-plane VLAN receives sysctls + BBR + CPU governor + THP tuning. - Personas, cloned-personas, SNMP exporter all hardcode
nodeSelector: role=ngfw-dut. In single-node, dual-node and tri-node the corresponding UCS carries that label; in multi-node it is UCS-1.
4. browser engine Agents¶
Real-browser traffic generators based on browser engine + headless Chromium.
- Manifest:
k8s/20-agent-deployment.yaml - Base replicas: 10
- HPA range: 1 → 300 replicas
- Controller URL:
https://dashboard.web-agents.svc.cluster.local
Lifecycle¶
register -> poll for URL -> navigate (Chromium) -> collect timing -> report -> wait -> repeat
Timing metrics include DNS, TCP/QUIC connect, TLS handshake, time to first byte, and full page load. The agent reports back to the Dashboard after each cycle.
DUT-mode patch¶
k8s/dut/40-playwright-patch.yaml overlays:
- Adds a
net1macvlan attachment on VLAN 20. - Mounts the
ngfw-caConfigMap asNODE_EXTRA_CA_CERTS. - Sets
REJECT_INVALID_CERTS=trueandREQUIRE_TLS13=trueto fail loudly on misconfigured trust.
| Variable | Base mode (no NGFW) | DUT mode |
|---|---|---|
NODE_EXTRA_CA_CERTS |
persona-ca-bundle |
ngfw-ca ConfigMap |
REJECT_INVALID_CERTS |
(default) | true |
REQUIRE_TLS13 |
(default) | true |
5. synthetic-load engine Agents¶
High-volume HTTP load generators, complementary to browser engine. synthetic-load engine produces traffic without the overhead of a full browser engine, allowing higher request rates per pod.
- Manifest:
k8s/21-k6-agent-deployment.yaml - Base replicas: 1
- HPA range: 0 → 1000 replicas
DUT-mode patch¶
- Adds
net1macvlan attachment on VLAN 30. - Sets both
SSL_CERT_FILEandNODE_EXTRA_CA_CERTSto thengfw-caConfigMap.
synthetic-load engine and browser engine share the same Dashboard control plane and the same Persona endpoints. The two fleets together produce the request mix used to stress the NGFW.
6. Synthetic Personas (20 webservers)¶
A fleet of 20 distinct webservers that mimic the kinds of sites a real user fleet would hit during a typical browsing day. Each persona has its own namespace, its own VLAN, its own /27 subnet, and its own gateway on the NGFW.
- Source of truth:
personas.yaml. - Generated:
personas/_generated/(do not hand-edit). - Per-persona namespace:
persona-{name}. - Webserver: Caddy (HTTP/1.1 + HTTP/2 + HTTP/3), with optional sidecar backend.
- Certificate: issued by
persona-ca-issuer(cluster-wide).
Archetypes¶
| Archetype | Letter | Backend | Personas |
|---|---|---|---|
| skin | C | Caddy file_server over content seeded by initContainer |
blog, docs, gallery, stream, download, edu, gov, cdn |
| real-app | A / B | Caddy reverse_proxy to a real-app sidecar |
shop (Saleor + Redis + Postgres), news (Ghost) |
| mock | D | Caddy reverse_proxy to a Go mock-engine sidecar |
api-rest, api-graphql, chat, webhook, telemetry, ads |
| har-replay | F | Caddy reverse_proxy to a Go HAR-replay engine sidecar |
har-saas, har-social, har-webmail, har-media |
Full VLAN map¶
| Persona | VLAN | Subnet | Gateway | Archetype |
|---|---|---|---|---|
| shop | 101 | 10.1.1.0/27 |
10.1.1.1 |
real-app A |
| news | 102 | 10.1.2.0/27 |
10.1.2.1 |
real-app B |
| blog | 103 | 10.1.3.0/27 |
10.1.3.1 |
skin |
| docs | 104 | 10.1.4.0/27 |
10.1.4.1 |
skin |
| gallery | 105 | 10.1.5.0/27 |
10.1.5.1 |
skin |
| stream | 106 | 10.1.6.0/27 |
10.1.6.1 |
skin |
| download | 107 | 10.1.7.0/27 |
10.1.7.1 |
skin |
| edu | 108 | 10.1.8.0/27 |
10.1.8.1 |
skin |
| gov | 109 | 10.1.9.0/27 |
10.1.9.1 |
skin |
| cdn | 110 | 10.1.10.0/27 |
10.1.10.1 |
skin |
| api-rest | 111 | 10.1.11.0/27 |
10.1.11.1 |
mock |
| api-graphql | 112 | 10.1.12.0/27 |
10.1.12.1 |
mock |
| chat | 113 | 10.1.13.0/27 |
10.1.13.1 |
mock |
| webhook | 114 | 10.1.14.0/27 |
10.1.14.1 |
mock |
| telemetry | 115 | 10.1.15.0/27 |
10.1.15.1 |
mock |
| ads | 116 | 10.1.16.0/27 |
10.1.16.1 |
mock |
| har-saas | 117 | 10.1.17.0/27 |
10.1.17.1 |
har-replay |
| har-social | 118 | 10.1.18.0/27 |
10.1.18.1 |
har-replay |
| har-webmail | 119 | 10.1.19.0/27 |
10.1.19.1 |
har-replay |
| har-media | 120 | 10.1.20.0/27 |
10.1.20.1 |
har-replay |
7. Cloned Personas (10 slots)¶
Pre-provisioned slots for serving content cloned from real public sites, used to extend the fleet beyond the 20 hand-curated Synthetic Personas.
- Manifests:
k8s/clone-personas/. - Slot count: 10, each pre-provisioned at 0 replicas.
- Per-slot namespace:
clone-persona-N(1 ≤ N ≤ 10). - Webserver: Caddy
file_serverserving/mnt/cloned/{env.SITE_NAME}from a read-only PVC. No reverse proxy — content is static, exactly as captured by the Cloner. - Orchestration: the Dashboard exposes
PATCH /api/clone/persona-slots/{n}which patches the slot's ConfigMap (changingSITE_NAME) and scales the Deployment. - Restart trigger: Stakater Reloader restarts the pod whenever
SITE_NAMEchanges or the cert Secret is renewed by cert-manager.
Slot VLAN map¶
| Slot | VLAN | Subnet | Gateway |
|---|---|---|---|
| 1 | 200 | 10.2.1.0/27 |
10.2.1.1 |
| 2 | 201 | 10.2.2.0/27 |
10.2.2.1 |
| 3 | 202 | 10.2.3.0/27 |
10.2.3.1 |
| 4 | 203 | 10.2.4.0/27 |
10.2.4.1 |
| 5 | 204 | 10.2.5.0/27 |
10.2.5.1 |
| 6 | 205 | 10.2.6.0/27 |
10.2.6.1 |
| 7 | 206 | 10.2.7.0/27 |
10.2.7.1 |
| 8 | 207 | 10.2.8.0/27 |
10.2.8.1 |
| 9 | 208 | 10.2.9.0/27 |
10.2.9.1 |
| 10 | 209 | 10.2.10.0/27 |
10.2.10.1 |
8. The Cloner¶
The Cloner is the only component allowed to reach the public Internet. It downloads complete sites and stores them on a shared PVC for the Cloned Persona slots to serve.
- Manifest:
k8s/81-cloner-deployment.yaml. - Replicas: 1.
Network interfaces¶
| Interface | Attachment | Purpose |
|---|---|---|
eth0 |
OOBI | Talk to Dashboard, CoreDNS, Postgres |
net1 |
VLAN 40 macvlan (DHCP) | Public Internet via the upstream ISP gateway |
DNS¶
DNS is forced (dnsPolicy: None) to a fixed list:
8.8.8.8(Google)208.67.222.222(OpenDNS)10.96.0.10(CoreDNS, in-cluster)
Policy routing¶
iptables marks any packet with a non-RFC1918 destination using fwmark. A separate routing table (table 100) routes marked packets out via the ISP gateway on net1. RFC1918 traffic continues to use the OOBI default route on eth0.
Lifecycle¶
poll Dashboard for pending clone jobs
-> download site via browser engine-extra (stealth mode) on net1
-> store under /mnt/cloned/{site}/ on PVC `cloned-sites`
-> report job completion to Dashboard
Health monitor¶
Exposes Prometheus metrics on :8081/metrics:
| Metric | Meaning |
|---|---|
cloner_internet_up |
Gauge — Internet reachability (0/1) |
cloner_ping_rtt_ms |
Gauge — round-trip time to Internet probes (ms) |
cloner_gateway_up |
Gauge — local ISP gateway reachability (0/1) |
For deep operational details, see docs/CLONER.md and docs/CLONER_OPERATIONS.md.
9. Certificate Orchestration (PKI)¶
A single internal CA covers every persona — Synthetic and Cloned — so the NGFW operator only ever has to import one CA certificate.
Trust chain¶
persona-selfsigned (ClusterIssuer)
|
v
persona-root-ca (Certificate, 10-year, in cert-manager namespace)
|
v
persona-ca-issuer (ClusterIssuer)
|
+-> persona-shop (Certificate, 1 yr)
+-> persona-news (Certificate, 1 yr)
+-> ... (20 Synthetic Persona certs)
+-> clone-persona-1 (Certificate, 1 yr)
+-> ... (10 Cloned Persona slot certs)
- Total persona certs: 30 (20 Synthetic + 10 Cloned), all issued by the same
persona-ca-issuer. - Single import: export
persona-ca-bundle ca.crtonce and import it into the NGFW as a trusted server CA.
Agent trust (base mode)¶
A persona-ca-bundle Secret is published in the web-agents namespace; agents mount ca.crt and pass it via NODE_EXTRA_CA_CERTS.
Auto-restart on rotation¶
k8s/87-stakater-reloader.yaml deploys Stakater Reloader with watches across persona-* and clone-persona-* namespaces. When cert-manager rotates a cert Secret, the corresponding Deployment is rolling-restarted so the new key/cert is picked up without operator action.
TLS posture (every Persona Caddy)¶
| Setting | Value |
|---|---|
| Protocols | h1 h2 h3 |
| Minimum TLS version | TLS 1.2 (TLS 1.3 implied for QUIC/h3) |
| Cipher suites | ECDHE+AEAD only — 6 suites, ECDSA listed first |
| Session tickets | disabled |
Disabling session tickets is intentional: every connection performs a full TLS handshake. This maximizes the CPU load on the NGFW per request, which is exactly what the test measures.
10. Dashboard (Orchestration)¶
The control plane for the entire fleet — agents register here, personas are scaled here, clone jobs are submitted here.
- Manifest:
k8s/50-dashboard.yaml. - Stack: Next.js 15.
- Replicas: 2.
Key APIs¶
| Method | Path | Purpose |
|---|---|---|
| POST | /api/agents/register |
Agent registration on startup |
| GET | /api/agents/poll |
Agent fetches the next URL to hit |
| POST | /api/agents/result |
Agent reports timing + outcome of a cycle |
| PATCH | /api/personas/{name} |
Start/stop/scale a Synthetic Persona |
| POST | /api/clone/jobs |
Create a clone job for the Cloner to consume |
| GET | /api/clone/persona-slots |
List Cloned Persona slot status |
| GET | /api/clone/persona-slots/{n} |
Slot detail |
| PATCH | /api/clone/persona-slots/{n} |
Bind a slot to a cloned site name + scale replicas |
| GET / POST | /api/admin/dut-api/devices |
DUT API device registry (CRUD, encrypted credentials) |
| POST | /api/admin/dut-api/devices/{id}/test |
Live connectivity + auth test against a registered DUT |
| POST | /api/admin/dut-api/devices/{id}/snapshot |
On-demand sanitized config snapshot |
| GET | /api/admin/time-sync/status |
NTP relay reachability + per-host clock skew |
| POST | /api/admin/time-sync/verify |
Force a fresh time-sync verification cycle |
| GET | /api/admin/audit?actor=&from=&to= |
Audit log query (admin mutations, login attempts) |
| GET | /api/test-plans |
List the 15-plan catalog (capacity, soak, decrypt …) |
| POST | /api/test-runs/start |
Start a run from a plan (snapshots plan + DUT inventory) |
| POST | /api/test-runs/{id}/preflight |
Run the 5-check preflight gate |
| GET | /api/test-runs/{id}/report.json |
Canonical ReportData (Phase 1) |
| GET | /runs/{id}/report |
Print-styled HTML report (page-level Phase 1 surface) |
RBAC¶
The dashboard ServiceAccount holds a ClusterRole granting get/list/watch/patch on Deployments and ConfigMaps, scoped across all persona-* and clone-persona-* namespaces. This is what lets the Dashboard scale personas and rebind clone slots at runtime.
11. Observability¶
Prometheus ServiceMonitors (Kubernetes mode, Prometheus Operator)¶
| ServiceMonitor | Namespace | Notes |
|---|---|---|
dashboard |
web-agents |
|
pgbouncer |
web-agents |
|
cloner |
web-agents |
|
persona-{name} × 20 |
each persona-* |
Relabel: persona, persona_archetype, persona_vlan |
clone-persona-{n} × 10 |
each clone-persona-* |
Relabel: cloned_persona, cloned_persona_vlan |
DUT SNMP scraping¶
Defined in observability/prometheus/prometheus.dut.yml:
| Job | Source |
|---|---|
snmp_nexus9000 |
Cisco Nexus 9000 switch metrics |
snmp_ngfw_dut |
NGFW physical metrics — module configurable: cisco_ftd, fortinet_fortigate, palo_alto, checkpoint, etc. |
node_exporter |
Ubuntu Linux host metrics, file-based service discovery |
Grafana¶
The default deployment ships seven dashboards:
| Dashboard | UID | What it covers |
|---|---|---|
| Agent fleet | web-agents |
Agents alive, runs/min, cycle duration, throughput, error rate, webservers active, time series + logs panel |
| Cloner health | cloner |
ISP up, ping 8.8.8.8 / 1.1.1.1, gateway ping, RTT timeseries |
| Cloned Personas | cloned-personas |
Active slots, requests/s, p99 latency, bytes/s |
| Nexus 9000 | dut-nexus9000 |
Switch CPU/mem/temp/fans/PSU, CRC/FCS, queue drops, optical sensors |
| NGFW DUT | dut-ngfw |
Throughput per VLAN, sessions, CPS, HW crypto engine, CPU/mem |
| Ubuntu Hosts (UCS) | dut-ubuntu-hosts |
OS-level: CPU, load, memory, disk I/O, filesystem, network, processes |
| TLS Decrypt Mode | decrypt-mode |
Issuer-cert ground-truth probe (ACTIVE/BYPASS), TLSDecryptModeChanged alert state |
Two more dashboards are bundled with the UCS CIMC adapter (Cisco UCS hardware monitoring, see §13 — Operational Subsystems).
Syslog correlation (OOBI-only policy)¶
Promtail listens on UDP/TCP :514 bound exclusively to the OOBI subnet (192.168.90.0/24). NGFW + Nexus syslog enters Loki via this OOBI listener; data-plane VLANs (browser engine, synthetic-load engine, Personas, Cloner) are deliberately not allowed to ingest log entries — preventing log injection from the test-traffic path. Test-run timelines join Loki entries to runs by timestamp + DUT identity (DUT API snapshot), giving operators a single Grafana panel per run with NGFW errors, drops, and policy hits inline. See docs/SYSLOG_CORRELATION.md and docs/SYSLOG_OPERATIONS.md.
12. Performance Tuning¶
Node tuning DaemonSet¶
k8s/dut/85-node-tuning.yaml applies the following on every node:
| Setting | Value | Reason |
|---|---|---|
net.core.rmem_max / wmem_max |
67108864 (64 MB) |
Required headroom for QUIC/UDP buffers |
net.ipv4.tcp_congestion_control |
bbr |
Better throughput on the lossy DUT path |
net.core.default_qdisc |
fq |
Required by BBR |
| CPU governor | performance |
Avoid frequency-scaling jitter during measurement |
| Transparent Huge Pages | defer+madvise |
Reduce stalls without giving up THP entirely |
Caddy pods (Go runtime)¶
| Variable | Value |
|---|---|
GOMAXPROCS |
from resourceFieldRef (limits.cpu) |
GOMEMLIMIT |
90% of the pod memory limit |
GOGC |
200 |
Nexus 9000 switch¶
| Setting | Value |
|---|---|
| EEE | off |
| Flow control | off |
| MTU | 9216 (jumbo) |
| QoS DSCP | AF41 |
| ECMP hash | includes UDP source port (so QUIC flows distribute evenly) |
| ARP timeout | 300 s |
13. Operational Subsystems¶
The following subsystems shipped in v3.7+/v4.0.0 and form the operator-facing day-2 surface. Each is a thin, well-bounded module — not a heavyweight framework — so the engineering surface stays small.
13.1 DUT API Integration¶
Vendor adapters in dashboard/src/lib/dut-api/ capture sanitized config + identity from the Device Under Test on every test-run start (or on demand from /admin/dut-api). Four adapters ship today:
| Adapter | Transport | What it captures |
|---|---|---|
cisco-ftd.ts |
REST (cdFMC / FMC) | Model, serial, version, decrypt-policy state, sanitized running-config |
cisco-nexus.ts |
NX-API (HTTPS) | Model, serial, version, interface counters, port-channel state |
cisco-ucs-cimc.ts |
Redfish (HTTPS) | Chassis/blade hardware inventory, PSU, fans, temp, BIOS |
fortinet-fortigate.ts |
REST (FortiOS) | Model, serial, version, sanitized config, session table summary |
Credentials live encrypted-at-rest (encryption.ts); snapshots are hashed (SHA-256) and embedded in the Test Run Report Annex B/C/D. A polling worker (poller.ts) refreshes liveness every 60 s; failures show on the /admin/dut-api UI. Full reference: docs/DUT_API_INTEGRATION.md · docs/DUT_API_OPERATIONS.md.
13.2 Test Plans¶
platform/test-plans/catalog.yaml ships 15 vendored, git-versioned plans (capacity-find-knee, soak-30m, decrypt-on-vs-off, vendor-compare, …). Each plan declares phases[] with VU/duration/target-mix and requirements{} (ngfw_state_required, personas_min, decrypt_mode_required). When a run starts, the plan is frozen — planSnapshotSha256 is recorded so post-run review can prove the parameters were not edited. Full reference: docs/TEST_PLANS.md.
13.3 Test Run Reports¶
Phase 1 ships a print-styled HTML report at /runs/{executionId}/report with the canonical ReportData JSON at /api/test-runs/{executionId}/report.json. The cover prints reportSha256 + planSnapshotSha256 + license badge + SLO pass/fail + budget burn. Phases 2–5 add server-rendered PDF (Puppeteer), DUT inventory annexes (Nexus / NGFW / UCS), Cosign signature + Rekor entry, and N-run comparison. Full reference: docs/REPORTS.md.
13.4 Pre-flight Check Engine¶
dashboard/src/lib/preflight/ runs a 5-check catalog before every test-run start; failures block. Catalog: subnet conflict (OOBI ↔ ISP ↔ persona VLANs), NGFW reachability + decrypt policy state matches plan, persona PKI freshness, NTP relay clock skew within tolerance, DUT API auth + snapshot succeeds. Operators bypass in non-strict mode but the bypass is logged in audit_log and printed on the report cover. Full reference: docs/PREFLIGHT_CHECKS.md.
13.5 Time-sync Layer¶
The OOBI subnet runs an in-cluster chronyd relay. Every UCS host syncs to the relay; the relay holds drift to the upstream stratum-1 source (cluster operator chooses public NTP, GPS, or PTP grandmaster). On every test-run start, each agent + DUT records its current epoch + computed skew vs. the relay; reports flag any window that drifted beyond tolerance. For air-gapped labs, a browser-clock fallback button on /admin/time-sync lets the operator pin reference time from the operator's own laptop. Surfaces at /admin/time-sync + alerts via the OOBI Prometheus.
13.6 Syslog Correlation¶
Already documented in §11 (Observability). Summary: Promtail bound to OOBI-only on UDP/TCP :514; NGFW + Nexus emit syslog → Loki; reports embed correlated entries by timestamp + DUT identity.
14. Onboarding & IP Protection¶
14.1 Onboarding chain¶
The path from "can I see this project?" to "I have a running install" is a 3-stage gated flow, with each stage having its own doc in 3 languages (EN / PT-BR / ES):
| Step | Doc | Audience | What happens |
|---|---|---|---|
| 1 | ACCESS_REQUEST.md |
External pre/post-sales engineer | Submit Cisco Access Broker request; broker auto-recognizes Cisco/partner domains and routes to the project owner |
| 2 | CLONE_FOR_INSTALL.md |
Approved engineer | Clone the (private) repo against your Cisco-issued GitHub identity; license acceptance modal records the session |
| 3 | RUNBOOK_FIRST_INSTALL.md |
Newly-onboarded engineer | Walk through first install (single/dual/tri/multi-node); bottom-of-doc breadcrumb back to BRAND, DUT_TESTBED, REPORTS |
Disconnected environments use AIRGAP_INSTALL.md as a step-3 substitute. Cross-references between the four docs form a chain so an engineer landing on any one of them can navigate forward and backward without leaving the doc set.
14.2 IP protection¶
All forensic identifiers — cert fingerprints, deployment hashes, asset signatures, TLS Decrypt-Mode snapshots — live in a separate private/forensic repo with the project owner as sole collaborator. The public repo never contains identity-binding data. This separation is policy-enforced because GitHub does not support per-branch ACLs: any collaborator with pull on the public repo would otherwise see every branch including a hypothetical in-tree forensic branch. Full policy: docs/IP_PROTECTION.md. Setup procedure for the private repo: docs/PRIVATE_REPO_SETUP.md.
15. Production Deployment Steps¶
The canonical bring-up is scripts/k8s-install.sh, which handles k3s, VLAN setup, cert-manager, Multus CNI, manifest application, and overlays in one step. It supports all four topologies via --mode={single,dual,tri,multi}:
# Single-node (one UCS, all workloads):
sudo ./scripts/k8s-install.sh --mode=single --data-iface=eth1
# Multi-node (UCS-1 personas, UCS-2 browser engine, UCS-3 synthetic-load engine, UCS-4 services):
sudo ./scripts/k8s-install.sh --mode=multi-server --role=ngfw-dut --data-iface=eth1
# … then on each agent UCS:
sudo ./scripts/k8s-install.sh --mode=multi-agent --role=playwright --data-iface=eth1
sudo ./scripts/k8s-install.sh --mode=multi-agent --role=k6 --data-iface=eth1
# … and finally on the services UCS:
sudo ./scripts/k8s-install.sh --mode=multi-apply
For operators who prefer to apply manifests by hand, the equivalent ordered procedure is:
# 1. Base stack: namespaces, RBAC, Dashboard, agents, OOBI infra
kubectl apply -f k8s/
# 2. Platform services: PKI, DNS, observability, test-plans
kubectl apply -k platform/
# 3. Synthetic Personas (20 webservers, generated from personas.yaml)
kubectl apply -k personas/
# 4. Cloned Personas (10 pre-provisioned slots, initially scaled to 0)
kubectl apply -k k8s/clone-personas/
# 5. DUT overlay: macvlan attachments, node tuning, NGFW trust
kubectl apply -k k8s/dut/
# 6. Patch agents into DUT mode (adds net1, ngfw-ca trust, reject-on-bad-cert)
kubectl patch deployment playwright-agent -n web-agents --patch-file k8s/dut/40-playwright-patch.yaml
kubectl patch deployment k6-agent -n web-agents --patch-file k8s/dut/50-k6-patch.yaml
# 7. Export the persona CA and import into the NGFW as trusted server CA
kubectl get secret persona-ca-bundle -n web-agents -o jsonpath='{.data.ca\.crt}' | base64 -d > persona-ca.crt
# Then import persona-ca.crt into the NGFW's trusted CA store —
# or use scripts/inject-ngfw-ca.sh for a Cisco FTD / Nexus-vendor automation.
After step 7 the NGFW will trust every persona server cert and the Two-Leg TLS architecture is fully wired.
16. Key File Reference Table¶
| Path | Purpose |
|---|---|
personas.yaml |
Source of truth for all 20 Synthetic Personas |
personas/_generated/ |
Generated per-persona manifests — do not hand-edit |
k8s/20-agent-deployment.yaml |
browser-engine agents (10 base, HPA 1–300) |
k8s/21-k6-agent-deployment.yaml |
synthetic-load agents (1 base, HPA 0–1000) |
k8s/50-dashboard.yaml |
Dashboard (Next.js 15, 2 replicas) |
k8s/81-cloner-deployment.yaml |
Cloner pod (single replica, dual NIC) |
k8s/87-stakater-reloader.yaml |
Reloader for cert rotation + slot rebinding |
k8s/dut/20-network-attachments.yaml |
macvlan NetworkAttachmentDefinitions for the agent fleet |
k8s/dut/40-playwright-patch.yaml |
Patches browser-engine agents into DUT mode (VLAN 20, ngfw-ca) |
k8s/dut/50-k6-patch.yaml |
Patches synthetic-load agents into DUT mode (VLAN 30, ngfw-ca) |
k8s/dut/85-node-tuning.yaml |
Sysctls + CPU governor + THP DaemonSet |
k8s/clone-personas/ |
10 pre-provisioned Cloned Persona slot manifests |
platform/pki/ |
persona-selfsigned, persona-root-ca, persona-ca-issuer |
platform/dns/ |
CoreDNS configuration |
platform/observability/ |
Prometheus Operator + Grafana |
observability/prometheus/prometheus.yml |
Base Prometheus configuration |
observability/prometheus/prometheus.dut.yml |
DUT-mode Prometheus: SNMP + node_exporter scrape jobs |
dashboard/src/lib/dut-api/ |
4 vendor adapters (Cisco FTD, Nexus, UCS CIMC Redfish, FortiGate) |
dashboard/src/lib/preflight/ |
Pre-flight check engine (5-check catalog, blocking gate) |
dashboard/src/app/admin/dut-api/ |
DUT API admin UI: register, test, snapshot, preflight |
dashboard/src/app/admin/time-sync/ |
Time-sync admin UI: NTP relay status, browser-clock fallback |
dashboard/src/app/admin/audit/ |
Audit log viewer (admin mutations, login attempts) |
dashboard/src/app/api/test-runs/ |
Test-run lifecycle (start, preflight, report.json) |
platform/test-plans/catalog.yaml |
The 15-plan catalog (capacity, soak, decrypt, vendor-compare, …) |
scripts/k8s-install.sh |
One-shot k3s + manifests installer (single/dual/tri/multi-node) |
scripts/host-tuning.sh |
Sysctls + CPU governor + THP + (optional) cpuManagerPolicy=static |
scripts/inject-ngfw-ca.sh |
Push NGFW CA bundle into agent ConfigMap |
scripts/secrets-init.sh |
Bootstrap k8s Secrets (controller token, postgres, dashboard auth) |
docs/CLONER.md |
Cloner: architecture, behavior, troubleshooting |
docs/CLONER_OPERATIONS.md |
Cloner: day-2 operations and runbooks |
docs/DUT_API_INTEGRATION.md |
DUT API: vendor adapters, snapshot, polling |
docs/DUT_API_OPERATIONS.md |
DUT API ops: registration, snapshot, troubleshooting |
docs/PREFLIGHT_CHECKS.md |
5-check catalog (subnet, reach, PKI, NTP, DUT API) |
docs/SYSLOG_CORRELATION.md |
Syslog OOBI-only correlation policy |
docs/SYSLOG_OPERATIONS.md |
Syslog ops: Promtail :514, Loki, Grafana correlation |
docs/TEST_PLANS.md |
The 15 catalog plans + plan-snapshot semantics |
docs/REPORTS.md |
Test Run Reports (Phase 1 shipped, Phases 2–5 roadmap) |
docs/MONITORING_TEST_VALIDITY.md |
Alerts that prove the test-bed itself was healthy during a run |
docs/TLS_DECRYPT_MODE_VERIFICATION.en.md |
Independent issuer-cert probe (decrypt ACTIVE/BYPASS, alert) |
docs/ACCESS_REQUEST.md |
Onboarding step 1: lab access via Cisco Access Broker |
docs/CLONE_FOR_INSTALL.md |
Onboarding step 2: clone repo for first install |
docs/RUNBOOK_FIRST_INSTALL.md |
Onboarding step 3: first-install runbook |
docs/AIRGAP_INSTALL.md |
Onboarding addendum: air-gapped install procedure |
docs/IP_PROTECTION.md |
IP protection policy (private/forensic separation) |
docs/PRIVATE_REPO_SETUP.md |
Setup procedure for the private companion repo |
docker-compose.cloner.yml |
Cloner stack for Docker-based dev (split topology) |
End of System Overview. For component deep-dives, see the per-topic documents in docs/.