TLSStress.Art — What it does¶

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. Web Agent Cluster for NGFW TLS Inspection performance test — an open-source test bench to measure the actual TLS inspection capacity of Next-Generation Firewalls (NGFWs) under HTTP/2 and HTTP/3 load.

Author: André Luiz Gallon · License: PolyForm Noncommercial 1.0.0 + Appendix A · Audience: Cisco employees and certified partners

1. Why this software exists¶

When a company needs to test the TLS inspection capacity of a corporate firewall, the market has two extremes:

Category	Examples	Limitation
Free tools	TRex, iPerf3, wrk, Locust	Do not generate realistic TLS traffic at scale — raw packets or simple HTTP, no JavaScript, no full TLS handshake, no HTTP/3
Commercial appliances	Spirent CyberFlood, Ixia BreakingPoint, IXIA IxLoad	US$ 50K–500K per chassis; proprietary hardware; hard to automate in CI/CD
TLSStress.Art	This project (open-source, PolyForm Noncommercial)	Runs on commodity hardware (Ubuntu + k3s); real browser (browser engine) + synthetic load (k6); native HTTP/2 and HTTP/3; no license cost for the eligible audience

TLSStress.Art fills this gap: it provides TLS inspection testing with production-level realism — real browsers, JavaScript, cookies, valid certificates — at no license cost and with full automation via Kubernetes and GitHub Actions.

The question the product answers: how many simultaneous TLS connections can the firewall handle before performance degrades — and what is the comparative impact between HTTP/2 and HTTP/3?

2. How it works — 30-second view¶

┌─────────────────────────┐
│ Load robots             │  browser engine (real browser) + k6 (HTTP)
│ up to 300 PW + 1000 synthetic-load engine  │  generate TLS traffic against the personas
└──────────┬──────────────┘
           │ Leg 1: encrypted HTTPS (robots trust the firewall cert)
           ▼
┌─────────────────────────┐
│ FIREWALL (DUT)          │  Appliance under test — decrypts, inspects,
│ — under test, measured  │  re-encrypts every packet
└──────────┬──────────────┘
           │ Leg 2: re-encrypted HTTPS (firewall talks to persona)
           ▼
┌─────────────────────────┐
│ 30 Caddy personas       │  20 Synthetic (always-on) + 10 Cloned slots
│ 20 Synthetic + 10       │  (operator picks public sites to clone)
│ Cloned slots            │
└─────────────────────────┘

         ┌─── Telemetry collected in parallel (4 pillars) ─────┐
         ▼                                                     ▼
   SNMP exporter        Promtail/Loki          DUT API REST       OpenTelemetry → Tempo
   (CPU/mem/IF of       (events from Nexus,    (config + state    (distributed tracing
    Nexus + DUT)         NGFW, UCS hosts)       of FTD/Nexus/      opt-in in agents)
                                                UCS/Fortinet)

Leg 1 — robots open HTTPS connections to persona addresses. The firewall intercepts and presents its own certificate (operator configures agents to trust ngfw-ca).

Leg 2 — the firewall opens a second HTTPS connection to the destination persona. Certificate is issued by cert-manager (CA persona-ca).

4 telemetry pillars collect evidence simultaneously — metrics (Prometheus + SNMP), events (syslog correlation via Loki), device state (DUT API REST), and traces (Tempo) — answering not only "what was the p99" but also "what happened on the DUT when p99 spiked".

3. The 30 personas¶

The persona fleet is split into two groups:

20 Synthetic (always-on)¶

Defined in personas.yaml, generated from Caddy templates. Each persona runs in a dedicated Kubernetes namespace, with TLS certificate, fixed IP, exclusive VLAN (101–120), and dedicated DNS entry.

Distributed across 4 archetypes:

Archetype	Examples	How traffic is generated
skin	CDN, blog, portal, news, gov, edu, gallery, stream, download, docs	Synthetic HTML/CSS/JS; high-throughput static assets
mock	api-rest, api-graphql, chat, webhook, telemetry, ads	YAML-configurable JSON/XML; simulates microservices
har-replay	har-saas, har-social, har-webmail, har-media	Replays HAR (HTTP Archive) recordings of real sessions
real-app	shop (Saleor, full e-commerce)	Real application with database, state, dynamic JavaScript

10 Cloned slots (filled by the operator)¶

VLANs 200–209. Operator picks a public site (e.g., globo.com, cnn.com), the Cloner crawls + snapshots the static content, and the replica then serves locally as a "real" persona on *.persona.internal.

The Cloner uses three distinct network interfaces: OOBI (mgmt), VLAN 40 (DHCP from the customer's ISP, to download public Internet content), and a macvlan VLAN inside the cluster (to serve cloned content to the other personas/agents).

4. The 4 telemetry pillars¶

The product does not measure latency only. To answer why a metric changed — a prerequisite for an NGFW test to be valid — we collect evidence from four sources simultaneously:

Pillar	Source	What it captures	Endpoint
1. Metrics (SNMP + Caddy + agents)	SNMP exporter, Caddy `/metrics`, browser-engine/synthetic-load	CPU, memory, and interface counters of switch and firewall; req/s, bytes, active connections, QUIC vs TCP on personas; latency, throughput, error rate, TLS handshake on agents	Prometheus → Grafana
2. Events (Syslog correlation)	Promtail (NodePort 30514) → Loki	Events from Nexus 9000, NGFW DUT, UCS hosts. OOBI-only policy (NEVER on data plane). Ingestion enforced by two cluster-level `NetworkPolicy` resources	Loki → Grafana Explore + "Syslog Correlation" dashboard
3. DUT state (REST API)	Polling worker in Dashboard pod	For each registered DUT: version, NTP config, SSL/TLS inspection policy, HA state, LLDP/CDP neighbors, hardware inventory. Snapshots signed with SHA-256 (forensic chain of custody)	DUT API REST adapters (FTD, Nexus, UCS, Fortinet) → Postgres `dut_api_snapshots`
4. Traces (OpenTelemetry → Tempo)	Opt-in SDK in browser-engine + synthetic-load	Distributed tracing per test cycle, with spans for each TLS handshake. Covers the end-to-end path: agent → NGFW → persona	Tempo → Grafana → Dashboard

The combination of all four enables answers like: "p99 latency spiked at 14:23 → confirmed in syslog: NGFW logged high CPU at 14:22 → confirmed in DUT API: decryption policy was changed by another operator 4 minutes earlier → trace shows handshake taking 380ms vs 95ms baseline".

5. DUT API — multi-vendor integration¶

Adapter pattern with 4 vendors shipped (Palo Alto on roadmap):

Vendor	Adapter	Auth	Status
Cisco FTD (FDM-managed)	`cisco-ftd.ts`	OAuth2 → Bearer	✅ Shipped
Cisco Nexus 9000	`cisco-nexus.ts`	NX-API cookie (APIC-cookie)	✅ Shipped
Cisco UCS C-Series CIMC	`cisco-ucs-cimc.ts`	Basic Auth (Redfish)	✅ Shipped
Fortinet FortiGate	`fortinet-fortigate.ts`	API key Bearer (FortiOS REST v2)	✅ Shipped
Palo Alto (PAN-OS)	—	API key (XML API)	📋 Roadmap

Operator workflow: 1. UI at /admin/dut-api registers a device (URL, credentials — encrypted with AES-256-GCM) 2. Polling worker in Dashboard collects snapshots every N minutes (configurable; default 5min) 3. Each snapshot becomes a row in dut_api_snapshots with JSON payload + SHA-256 of the canonical form 4. Snapshots show up in Grafana dashboards ("DUT Live State") and (in the roadmap) in Annexes B/C/D of the Test Run Report

Catalog of 45 features mapped per vendor — which adapter exposes NTP config, which exposes HA state, etc. — in docs/API_FEATURE_CATALOG.{md,pt-BR,es}.md.

6. Test Plans + Test Run Reports¶

Test Plan engine — 15 pre-configured plans (catalog in docs/TEST_PLANS.{md,pt-BR,es}.md): - Stable identifiers (string ID, not hash) for cross-engagement comparisons - Validated by Zod schema, sync'd to Postgres on boot - Each plan declares the expected NGFW state (e.g., ngfw_state_required: decryption-on) - Plans cover: H2/H3 baseline, max-CPS handshake, sustained throughput, decryption-on vs decryption-off, cert failures, mixed protocol, etc.

Test Run Report — Phase 1 shipped (#185): - Print-styled HTML (/runs/{id}/report) with cover, license page in 3 languages, executive summary, plan config, placeholder annexes - License footer pinned on every page - SHA-256 hashes of the full report + plan-snapshot + per-annex (forensic chain) - Parallel JSON API: /api/test-runs/{id}/report.json

Roadmap of pending phases: - Phase 2 — Puppeteer server-side render → real PDF (currently HTML print) - Phase 3 — Annexes B/C/D wired (Nexus/NGFW/UCS snapshots embedded in the report) - Phase 4 — Cosign signing + Rekor/Sigstore entry (public transparency log) - Phase 5 — N-run comparison (run A vs run B), trend rollups (weekly/monthly), replay snapshot

7. Pre-flight, time-sync, test bed validity¶

Pre-flight checks (engine + 5 catalog) — before each run, validates lab state: - ngfw-deploy-clean — no pending deploy on the DUT - ngfw-decrypt-state-matches-plan — decryption state aligned with the plan - ntp-source-configured — every component has NTP defined - ngfw-ha-state-sane — HA pair in consistent state - snapshot-fresh — last DUT API collection < threshold

Time-sync layer — clock of the entire lab synchronized: - Script scripts/check-time-sync.sh measures skew between components; Prometheus alerts fire above threshold - Cloner can act as NTP relay (stratum-2) when the data center's Internet is restricted - Browser-clock fallback documented and implemented: admin UI at /admin/time-sync sends the operator's laptop time via POST /api/time-sync/set-from-browser (with acknowledgement: NOT_FORENSIC mandatory), returns the kubectl chronyc settime command for the operator to apply manually. Audit log marks forensic_grade: false automatically.

Test-bed validity proof — observability evolution in 4 phases shipped: 1. Deployment-aware thresholds + fleet readiness (PR-1+2, #173) — alerts that know which deploy mode (single/dual/tri/multi) is active 2. Topology-aware causal correlation (PR-3, #177) — correlates metrics by the nodes/personas/agents that actually talk to each other 3. SLO + multi-window multi-burn-rate alerts + anomaly detection + Tempo (PR-4, #179) 4. Always-visible license banner in Dashboard, Grafana, Prometheus (#178)

8. Compliance, forensic, IP protection¶

Trademarked brand and official tagline: - Name: TLSStress.Art (™ — registration in progress) - Tagline: Web Agent Cluster for NGFW TLS Inspection performance test

License: PolyForm Noncommercial 1.0.0 + Appendix A — noncommercial use only, and Appendix A restricts the eligible audience to Cisco employees (and its subsidiaries) and pre-/post-sales engineers of Cisco-certified partners. Details in LICENSE and USAGE_POLICY.md.

License Acceptance Modal — first login on the Dashboard intercepts the operator with a modal requiring acceptance of the license + USAGE_POLICY. Acceptance writes a row in audit_log (visible at /admin/audit/license-acceptances). Privacy policy in PRIVACY_POLICY.md, audit policy in AUDIT_LOG.md.

Forensic infrastructure (IP_PROTECTION.md): - FINGERPRINT_REGISTRY in a private branch (private/forensic) — intentional fingerprints in the code that prove authorship if a clone appears - Asset hashes manifest signed with each release - Tamper-check workflow runs on PRs and on main, comparing hashes against the manifest - Roadmap: migrate FINGERPRINT_REGISTRY to a separate repo to avoid leakage via the Access Broker

TLS Decrypt Mode Verification (#180) — probe that cross-checks the issuer-cert observed on the connection against the expected decryption policy from the plan. Ensures the measurement reflects the NGFW's actual operating mode (without waiting for device logs).

9. Operator onboarding¶

Five chained docs to get an operator from zero to first run:

Access Request   →   Clone   →   Install (Runbook)
[ACCESS_REQUEST]     [CLONE_FOR_INSTALL]   [RUNBOOK_FIRST_INSTALL]
                                              │
                                              └── alternate: [AIRGAP_INSTALL]

Maintainer-only setup (one-time): [PRIVATE_REPO_SETUP]

Access Broker (ACCESS_REQUEST.md) — issue template + GitHub Action /approve /deny. Auto-list of Cisco-controlled domains (cisco.com, meraki.com, duo.com, webex.com, splunk.com, thousandeyes.com); manual flow with partner ID for certified partners.
Clone (CLONE_FOR_INSTALL.md) — 4 authenticated git clone options (HTTPS+PAT, SSH key, GitHub CLI, signed tarball).
Runbook First Install (RUNBOOK_FIRST_INSTALL.md) — 4-step protocol: lab→DUT→smoke→measure. Time budget: ~2h.
Air-gapped install (AIRGAP_INSTALL.md) — alternative for data centers without Internet (regulated / classified).
Private repo setup (PRIVATE_REPO_SETUP.md) — maintainer-only setup (one-time, by the repo owner).

10. Tuning applied at every layer¶

So that results reflect the firewall's real limit — not test bed limitations — every component of the platform was tuned with specific parameters. Automated scripts apply the tuning reproducibly.

Component	Parameters	Impact
Linux Ubuntu host (DaemonSet `node-tuning`)	`rmem_max`/`wmem_max` = 64 MB; TCP BBR + FQ; `tcp_max_syn_backlog=65535`; `vm.swappiness=5`; CPU governor `performance`; THP `madvise`; `nf_conntrack_max=2M`; `tcp_orphan_retries=2`	Critical for QUIC/HTTP3: large UDP buffers prevent drops; BBR improves throughput in networks with RTT variation
Caddy webserver (per persona pod)	`GOMAXPROCS` via `resourceFieldRef`; `GOMEMLIMIT=460MiB`; `GOGC=200`; Guaranteed QoS (CPU 2/2, Mem 512Mi/512Mi)	Go uses all available cores without over-commit; less frequent GC reduces tail latency
Nexus 9000 switch (automated script)	EEE off, flow control off, MTU 9216 (jumbo frames), QoS DSCP AF41, ECMP hash with UDP port, ARP timeout 300s	Jumbo frames increase throughput; EEE disabled eliminates added latency; ECMP with UDP port distributes QUIC correctly
browser-engine + synthetic-load agents	`NODE_EXTRA_CA_CERTS` / `SSL_CERT_FILE` pointing to `persona-ca`; `GOMAXPROCS`/`GOMEMLIMIT` on k6; resources right-sized; topology spread; `CYCLE_CONCURRENCY=3`	Trust in certs without warnings; UDP 443 open; max core occupancy without conflict

Detailed guides: docs/PERFORMANCE_TUNING_HOST.md, docs/NEXUS9K_TUNING.md, plus the artifacts k8s/85-node-tuning.yaml + scripts/nexus/0[1-3]-*.nxos.

The full setup is designed to be low-friction: ./scripts/k8s-dut-up.sh up brings up the entire platform in automatic phases.

11. Network architecture¶

Ubuntu Server (k3s)
├── eth0       →  OOBI network  (mgmt, metrics, dashboard, control, Service VIPs)
└── eth1       →  Nexus 9000 trunk  (data VLANs)
               ├── VLAN 20  (172.16.0.0/16)   →  browser-engine agents  (dut-pw)
               ├── VLAN 30  (172.17.0.0/16)   →  synthetic-load agents          (dut-k6)
               ├── VLAN 40  (DHCP from ISP)   →  Cloner ISP egress  (Internet uplink, outside our scheme)
               ├── VLAN 99  (192.168.90.0/24) →  OOBI mgmt          (dut-mgmt, SNMP, syslog, NTP)
               │
               │   20 Synthetic VLANs (101–120) — one per Caddy persona:
               ├── VLAN 101–120  (10.1.x.0/27)  →  shop, news, blog, docs, gallery, stream, download,
               │                                   edu, gov, cdn, api-rest, api-graphql, chat, webhook,
               │                                   telemetry, ads, har-saas, har-social, har-webmail, har-media
               │
               │   10 Cloned VLANs (200–209) — dynamic slots filled by the Cloner:
               └── VLAN 200–209  (10.2.x.0/27)  →  cloned-1 .. cloned-10
                               │
                        FIREWALL (DUT) ←── inspects TLS on all VLANs
                               │
                        4 telemetry pillars collect evidence in parallel
                        (SNMP + Syslog + DUT API + Tempo)

Why strict separation between OOBI ↔ data plane? - The data plane (VLANs 20, 30, 40, 101–120, 200–209) carries test traffic. Any mgmt traffic here contaminates per-cycle metrics. - The OOBI (VLAN 99 / 192.168.90.0/24) is exclusive to mgmt — Prometheus scrape, SNMP, syslog, NTP, kubectl, dashboard. The syslog-oobi-only + syslog-deny-data-plane policies are enforced at the cluster level by two complementary NetworkPolicy resources. - Service VIPs in OOBI (.50–.69) — Promtail :514, NTP relay :123, Dashboard :3000, Prometheus :9090, Grafana :3001, Loki :3100, SNMP-Exporter, Alertmanager, Tempo. DUT devices point to fixed IPs, not to NodePorts on a specific UCS.

12. 4 deployment modes¶

Mode	When to use	Guide
Single-node (1 UCS)	Initial evaluation; limited hardware	UBUNTU_K3S_SINGLENODE
Dual-node (2 UCS)	UCS-1 = agents; UCS-2 = personas + services + observability	UBUNTU_K3S_DUALNODE
Tri-node (3 UCS)	UCS-1 = browser engine only; UCS-2 = k6 only; UCS-3 = personas + services + obs	UBUNTU_K3S_TRINODE
Multi-node (4 dedicated UCS)	UCS-1 = 30 personas · UCS-2 = browser engine · UCS-3 = k6 · UCS-4 = Dashboard/Postgres/Grafana/Cloner. Maximum throughput, no contention	UBUNTU_K3S_MULTINODE

The dual/tri/multi modes use Kustomize overlays under overlays/{dual-node,tri-node,multi-node}/ that apply nodeSelector to each Deployment/StatefulSet. They allow browser engine to scale up to 300 instances and k6 up to 1,000 instances without sharing CPU/memory with the persona webservers.

13. Operations dashboard¶

Web cockpit (Next.js) at /. Main pages:

Overview — status of the 30 personas in real time (running, paused, error)
Individual personas — start/pause/deprovision; configure mock-persona response routes (inline YAML via API)
/admin/dut-api — register DUT devices, test connectivity, trigger snapshot, run pre-flight
/admin/time-sync — admin UI for the browser-clock fallback flow (with mandatory NOT_FORENSIC acknowledgement)
/admin/audit/license-acceptances — viewer of license acceptances (forensic audit)
Test Plans — pick a plan, trigger a run, monitor progress
Test Runs — history, link to HTML report (Phase 1)
Grafana metrics — embedded per persona and per protocol (HTTP/2 vs HTTP/3 vs HTTP/3-fallback-TCP)

Summary in one sentence¶

TLSStress.Art is an open-source test bench (PolyForm Noncommercial, Cisco audience) that fills the gap between free tools without scale (TRex, iPerf) and commercial appliances costing up to US$ 500K (Spirent, Ixia) — simulating up to 30 realistic sites (20 Synthetic + 10 operator-Cloned slots) with browser-engine/synthetic-load against the NGFW under test, integrating 4 telemetry pillars (hardware SNMP + syslog correlation + DUT API REST multi-vendor for FTD/Nexus/UCS/Fortinet + OpenTelemetry tracing), with 15 pre-configured test plans, forensically-signed HTML/PDF Test Run Reports, License Acceptance Modal + audit log + tamper-check, and everything tuned from the Linux host to the Nexus switch so that the measured bottleneck is always the NGFW — never the test bed.