TLSStress.Art — What it does¶
Read in your language: English · Português · Español
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. Web Agent Cluster for NGFW TLS Inspection performance test — an open-source test bench to measure the actual TLS inspection capacity of Next-Generation Firewalls (NGFWs) under HTTP/2 and HTTP/3 load.
Author: André Luiz Gallon · License: PolyForm Noncommercial 1.0.0 + Appendix A · Audience: Cisco employees and certified partners
1. Why this software exists¶
When a company needs to test the TLS inspection capacity of a corporate firewall, the market has two extremes:
| Category | Examples | Limitation |
|---|---|---|
| Free tools | TRex, iPerf3, wrk, Locust | Do not generate realistic TLS traffic at scale — raw packets or simple HTTP, no JavaScript, no full TLS handshake, no HTTP/3 |
| Commercial appliances | Spirent CyberFlood, Ixia BreakingPoint, IXIA IxLoad | US$ 50K–500K per chassis; proprietary hardware; hard to automate in CI/CD |
| TLSStress.Art | This project (open-source, PolyForm Noncommercial) | Runs on commodity hardware (Ubuntu + k3s); real browser (browser engine) + synthetic load (k6); native HTTP/2 and HTTP/3; no license cost for the eligible audience |
TLSStress.Art fills this gap: it provides TLS inspection testing with production-level realism — real browsers, JavaScript, cookies, valid certificates — at no license cost and with full automation via Kubernetes and GitHub Actions.
The question the product answers: how many simultaneous TLS connections can the firewall handle before performance degrades — and what is the comparative impact between HTTP/2 and HTTP/3?
2. How it works — 30-second view¶
┌─────────────────────────┐
│ Load robots │ browser engine (real browser) + k6 (HTTP)
│ up to 300 PW + 1000 synthetic-load engine │ generate TLS traffic against the personas
└──────────┬──────────────┘
│ Leg 1: encrypted HTTPS (robots trust the firewall cert)
▼
┌─────────────────────────┐
│ FIREWALL (DUT) │ Appliance under test — decrypts, inspects,
│ — under test, measured │ re-encrypts every packet
└──────────┬──────────────┘
│ Leg 2: re-encrypted HTTPS (firewall talks to persona)
▼
┌─────────────────────────┐
│ 30 Caddy personas │ 20 Synthetic (always-on) + 10 Cloned slots
│ 20 Synthetic + 10 │ (operator picks public sites to clone)
│ Cloned slots │
└─────────────────────────┘
┌─── Telemetry collected in parallel (4 pillars) ─────┐
▼ ▼
SNMP exporter Promtail/Loki DUT API REST OpenTelemetry → Tempo
(CPU/mem/IF of (events from Nexus, (config + state (distributed tracing
Nexus + DUT) NGFW, UCS hosts) of FTD/Nexus/ opt-in in agents)
UCS/Fortinet)
Leg 1 — robots open HTTPS connections to persona addresses. The firewall intercepts and presents its own certificate (operator configures agents to trust ngfw-ca).
Leg 2 — the firewall opens a second HTTPS connection to the destination persona. Certificate is issued by cert-manager (CA persona-ca).
4 telemetry pillars collect evidence simultaneously — metrics (Prometheus + SNMP), events (syslog correlation via Loki), device state (DUT API REST), and traces (Tempo) — answering not only "what was the p99" but also "what happened on the DUT when p99 spiked".
3. The 30 personas¶
The persona fleet is split into two groups:
20 Synthetic (always-on)¶
Defined in personas.yaml, generated from Caddy templates. Each persona runs in a dedicated Kubernetes namespace, with TLS certificate, fixed IP, exclusive VLAN (101–120), and dedicated DNS entry.
Distributed across 4 archetypes:
| Archetype | Examples | How traffic is generated |
|---|---|---|
| skin | CDN, blog, portal, news, gov, edu, gallery, stream, download, docs | Synthetic HTML/CSS/JS; high-throughput static assets |
| mock | api-rest, api-graphql, chat, webhook, telemetry, ads | YAML-configurable JSON/XML; simulates microservices |
| har-replay | har-saas, har-social, har-webmail, har-media | Replays HAR (HTTP Archive) recordings of real sessions |
| real-app | shop (Saleor, full e-commerce) | Real application with database, state, dynamic JavaScript |
10 Cloned slots (filled by the operator)¶
VLANs 200–209. Operator picks a public site (e.g., globo.com, cnn.com), the Cloner crawls + snapshots the static content, and the replica then serves locally as a "real" persona on *.persona.internal.
The Cloner uses three distinct network interfaces: OOBI (mgmt), VLAN 40 (DHCP from the customer's ISP, to download public Internet content), and a macvlan VLAN inside the cluster (to serve cloned content to the other personas/agents).
4. The 4 telemetry pillars¶
The product does not measure latency only. To answer why a metric changed — a prerequisite for an NGFW test to be valid — we collect evidence from four sources simultaneously:
| Pillar | Source | What it captures | Endpoint |
|---|---|---|---|
| 1. Metrics (SNMP + Caddy + agents) | SNMP exporter, Caddy /metrics, browser-engine/synthetic-load |
CPU, memory, and interface counters of switch and firewall; req/s, bytes, active connections, QUIC vs TCP on personas; latency, throughput, error rate, TLS handshake on agents | Prometheus → Grafana |
| 2. Events (Syslog correlation) | Promtail (NodePort 30514) → Loki | Events from Nexus 9000, NGFW DUT, UCS hosts. OOBI-only policy (NEVER on data plane). Ingestion enforced by two cluster-level NetworkPolicy resources |
Loki → Grafana Explore + "Syslog Correlation" dashboard |
| 3. DUT state (REST API) | Polling worker in Dashboard pod | For each registered DUT: version, NTP config, SSL/TLS inspection policy, HA state, LLDP/CDP neighbors, hardware inventory. Snapshots signed with SHA-256 (forensic chain of custody) | DUT API REST adapters (FTD, Nexus, UCS, Fortinet) → Postgres dut_api_snapshots |
| 4. Traces (OpenTelemetry → Tempo) | Opt-in SDK in browser-engine + synthetic-load | Distributed tracing per test cycle, with spans for each TLS handshake. Covers the end-to-end path: agent → NGFW → persona | Tempo → Grafana → Dashboard |
The combination of all four enables answers like: "p99 latency spiked at 14:23 → confirmed in syslog: NGFW logged high CPU at 14:22 → confirmed in DUT API: decryption policy was changed by another operator 4 minutes earlier → trace shows handshake taking 380ms vs 95ms baseline".
5. DUT API — multi-vendor integration¶
Adapter pattern with 4 vendors shipped (Palo Alto on roadmap):
| Vendor | Adapter | Auth | Status |
|---|---|---|---|
| Cisco FTD (FDM-managed) | cisco-ftd.ts |
OAuth2 → Bearer | ✅ Shipped |
| Cisco Nexus 9000 | cisco-nexus.ts |
NX-API cookie (APIC-cookie) | ✅ Shipped |
| Cisco UCS C-Series CIMC | cisco-ucs-cimc.ts |
Basic Auth (Redfish) | ✅ Shipped |
| Fortinet FortiGate | fortinet-fortigate.ts |
API key Bearer (FortiOS REST v2) | ✅ Shipped |
| Palo Alto (PAN-OS) | — | API key (XML API) | 📋 Roadmap |
Operator workflow:
1. UI at /admin/dut-api registers a device (URL, credentials — encrypted with AES-256-GCM)
2. Polling worker in Dashboard collects snapshots every N minutes (configurable; default 5min)
3. Each snapshot becomes a row in dut_api_snapshots with JSON payload + SHA-256 of the canonical form
4. Snapshots show up in Grafana dashboards ("DUT Live State") and (in the roadmap) in Annexes B/C/D of the Test Run Report
Catalog of 45 features mapped per vendor — which adapter exposes NTP config, which exposes HA state, etc. — in docs/API_FEATURE_CATALOG.{md,pt-BR,es}.md.
6. Test Plans + Test Run Reports¶
Test Plan engine — 15 pre-configured plans (catalog in docs/TEST_PLANS.{md,pt-BR,es}.md):
- Stable identifiers (string ID, not hash) for cross-engagement comparisons
- Validated by Zod schema, sync'd to Postgres on boot
- Each plan declares the expected NGFW state (e.g., ngfw_state_required: decryption-on)
- Plans cover: H2/H3 baseline, max-CPS handshake, sustained throughput, decryption-on vs decryption-off, cert failures, mixed protocol, etc.
Test Run Report — Phase 1 shipped (#185):
- Print-styled HTML (/runs/{id}/report) with cover, license page in 3 languages, executive summary, plan config, placeholder annexes
- License footer pinned on every page
- SHA-256 hashes of the full report + plan-snapshot + per-annex (forensic chain)
- Parallel JSON API: /api/test-runs/{id}/report.json
Roadmap of pending phases: - Phase 2 — Puppeteer server-side render → real PDF (currently HTML print) - Phase 3 — Annexes B/C/D wired (Nexus/NGFW/UCS snapshots embedded in the report) - Phase 4 — Cosign signing + Rekor/Sigstore entry (public transparency log) - Phase 5 — N-run comparison (run A vs run B), trend rollups (weekly/monthly), replay snapshot
7. Pre-flight, time-sync, test bed validity¶
Pre-flight checks (engine + 5 catalog) — before each run, validates lab state:
- ngfw-deploy-clean — no pending deploy on the DUT
- ngfw-decrypt-state-matches-plan — decryption state aligned with the plan
- ntp-source-configured — every component has NTP defined
- ngfw-ha-state-sane — HA pair in consistent state
- snapshot-fresh — last DUT API collection < threshold
Time-sync layer — clock of the entire lab synchronized:
- Script scripts/check-time-sync.sh measures skew between components; Prometheus alerts fire above threshold
- Cloner can act as NTP relay (stratum-2) when the data center's Internet is restricted
- Browser-clock fallback documented and implemented: admin UI at /admin/time-sync sends the operator's laptop time via POST /api/time-sync/set-from-browser (with acknowledgement: NOT_FORENSIC mandatory), returns the kubectl chronyc settime command for the operator to apply manually. Audit log marks forensic_grade: false automatically.
Test-bed validity proof — observability evolution in 4 phases shipped: 1. Deployment-aware thresholds + fleet readiness (PR-1+2, #173) — alerts that know which deploy mode (single/dual/tri/multi) is active 2. Topology-aware causal correlation (PR-3, #177) — correlates metrics by the nodes/personas/agents that actually talk to each other 3. SLO + multi-window multi-burn-rate alerts + anomaly detection + Tempo (PR-4, #179) 4. Always-visible license banner in Dashboard, Grafana, Prometheus (#178)
8. Compliance, forensic, IP protection¶
Trademarked brand and official tagline: - Name: TLSStress.Art (™ — registration in progress) - Tagline: Web Agent Cluster for NGFW TLS Inspection performance test
License: PolyForm Noncommercial 1.0.0 + Appendix A — noncommercial use only, and Appendix A restricts the eligible audience to Cisco employees (and its subsidiaries) and pre-/post-sales engineers of Cisco-certified partners. Details in LICENSE and USAGE_POLICY.md.
License Acceptance Modal — first login on the Dashboard intercepts the operator with a modal requiring acceptance of the license + USAGE_POLICY. Acceptance writes a row in audit_log (visible at /admin/audit/license-acceptances). Privacy policy in PRIVACY_POLICY.md, audit policy in AUDIT_LOG.md.
Forensic infrastructure (IP_PROTECTION.md):
- FINGERPRINT_REGISTRY in a private branch (private/forensic) — intentional fingerprints in the code that prove authorship if a clone appears
- Asset hashes manifest signed with each release
- Tamper-check workflow runs on PRs and on main, comparing hashes against the manifest
- Roadmap: migrate FINGERPRINT_REGISTRY to a separate repo to avoid leakage via the Access Broker
TLS Decrypt Mode Verification (#180) — probe that cross-checks the issuer-cert observed on the connection against the expected decryption policy from the plan. Ensures the measurement reflects the NGFW's actual operating mode (without waiting for device logs).
9. Operator onboarding¶
Five chained docs to get an operator from zero to first run:
Access Request → Clone → Install (Runbook)
[ACCESS_REQUEST] [CLONE_FOR_INSTALL] [RUNBOOK_FIRST_INSTALL]
│
└── alternate: [AIRGAP_INSTALL]
Maintainer-only setup (one-time): [PRIVATE_REPO_SETUP]
- Access Broker (
ACCESS_REQUEST.md) — issue template + GitHub Action/approve/deny. Auto-list of Cisco-controlled domains (cisco.com, meraki.com, duo.com, webex.com, splunk.com, thousandeyes.com); manual flow with partner ID for certified partners. - Clone (
CLONE_FOR_INSTALL.md) — 4 authenticatedgit cloneoptions (HTTPS+PAT, SSH key, GitHub CLI, signed tarball). - Runbook First Install (
RUNBOOK_FIRST_INSTALL.md) — 4-step protocol: lab→DUT→smoke→measure. Time budget: ~2h. - Air-gapped install (
AIRGAP_INSTALL.md) — alternative for data centers without Internet (regulated / classified). - Private repo setup (
PRIVATE_REPO_SETUP.md) — maintainer-only setup (one-time, by the repo owner).
10. Tuning applied at every layer¶
So that results reflect the firewall's real limit — not test bed limitations — every component of the platform was tuned with specific parameters. Automated scripts apply the tuning reproducibly.
| Component | Parameters | Impact |
|---|---|---|
Linux Ubuntu host (DaemonSet node-tuning) |
rmem_max/wmem_max = 64 MB; TCP BBR + FQ; tcp_max_syn_backlog=65535; vm.swappiness=5; CPU governor performance; THP madvise; nf_conntrack_max=2M; tcp_orphan_retries=2 |
Critical for QUIC/HTTP3: large UDP buffers prevent drops; BBR improves throughput in networks with RTT variation |
| Caddy webserver (per persona pod) | GOMAXPROCS via resourceFieldRef; GOMEMLIMIT=460MiB; GOGC=200; Guaranteed QoS (CPU 2/2, Mem 512Mi/512Mi) |
Go uses all available cores without over-commit; less frequent GC reduces tail latency |
| Nexus 9000 switch (automated script) | EEE off, flow control off, MTU 9216 (jumbo frames), QoS DSCP AF41, ECMP hash with UDP port, ARP timeout 300s | Jumbo frames increase throughput; EEE disabled eliminates added latency; ECMP with UDP port distributes QUIC correctly |
| browser-engine + synthetic-load agents | NODE_EXTRA_CA_CERTS / SSL_CERT_FILE pointing to persona-ca; GOMAXPROCS/GOMEMLIMIT on k6; resources right-sized; topology spread; CYCLE_CONCURRENCY=3 |
Trust in certs without warnings; UDP 443 open; max core occupancy without conflict |
Detailed guides: docs/PERFORMANCE_TUNING_HOST.md, docs/NEXUS9K_TUNING.md, plus the artifacts k8s/85-node-tuning.yaml + scripts/nexus/0[1-3]-*.nxos.
The full setup is designed to be low-friction: ./scripts/k8s-dut-up.sh up brings up the entire platform in automatic phases.
11. Network architecture¶
Ubuntu Server (k3s)
├── eth0 → OOBI network (mgmt, metrics, dashboard, control, Service VIPs)
└── eth1 → Nexus 9000 trunk (data VLANs)
├── VLAN 20 (172.16.0.0/16) → browser-engine agents (dut-pw)
├── VLAN 30 (172.17.0.0/16) → synthetic-load agents (dut-k6)
├── VLAN 40 (DHCP from ISP) → Cloner ISP egress (Internet uplink, outside our scheme)
├── VLAN 99 (192.168.90.0/24) → OOBI mgmt (dut-mgmt, SNMP, syslog, NTP)
│
│ 20 Synthetic VLANs (101–120) — one per Caddy persona:
├── VLAN 101–120 (10.1.x.0/27) → shop, news, blog, docs, gallery, stream, download,
│ edu, gov, cdn, api-rest, api-graphql, chat, webhook,
│ telemetry, ads, har-saas, har-social, har-webmail, har-media
│
│ 10 Cloned VLANs (200–209) — dynamic slots filled by the Cloner:
└── VLAN 200–209 (10.2.x.0/27) → cloned-1 .. cloned-10
│
FIREWALL (DUT) ←── inspects TLS on all VLANs
│
4 telemetry pillars collect evidence in parallel
(SNMP + Syslog + DUT API + Tempo)
Why strict separation between OOBI ↔ data plane?
- The data plane (VLANs 20, 30, 40, 101–120, 200–209) carries test traffic. Any mgmt traffic here contaminates per-cycle metrics.
- The OOBI (VLAN 99 / 192.168.90.0/24) is exclusive to mgmt — Prometheus scrape, SNMP, syslog, NTP, kubectl, dashboard. The syslog-oobi-only + syslog-deny-data-plane policies are enforced at the cluster level by two complementary NetworkPolicy resources.
- Service VIPs in OOBI (.50–.69) — Promtail :514, NTP relay :123, Dashboard :3000, Prometheus :9090, Grafana :3001, Loki :3100, SNMP-Exporter, Alertmanager, Tempo. DUT devices point to fixed IPs, not to NodePorts on a specific UCS.
12. 4 deployment modes¶
| Mode | When to use | Guide |
|---|---|---|
| Single-node (1 UCS) | Initial evaluation; limited hardware | UBUNTU_K3S_SINGLENODE |
| Dual-node (2 UCS) | UCS-1 = agents; UCS-2 = personas + services + observability | UBUNTU_K3S_DUALNODE |
| Tri-node (3 UCS) | UCS-1 = browser engine only; UCS-2 = k6 only; UCS-3 = personas + services + obs | UBUNTU_K3S_TRINODE |
| Multi-node (4 dedicated UCS) | UCS-1 = 30 personas · UCS-2 = browser engine · UCS-3 = k6 · UCS-4 = Dashboard/Postgres/Grafana/Cloner. Maximum throughput, no contention | UBUNTU_K3S_MULTINODE |
The dual/tri/multi modes use Kustomize overlays under overlays/{dual-node,tri-node,multi-node}/ that apply nodeSelector to each Deployment/StatefulSet. They allow browser engine to scale up to 300 instances and k6 up to 1,000 instances without sharing CPU/memory with the persona webservers.
13. Operations dashboard¶
Web cockpit (Next.js) at /. Main pages:
- Overview — status of the 30 personas in real time (running, paused, error)
- Individual personas — start/pause/deprovision; configure mock-persona response routes (inline YAML via API)
/admin/dut-api— register DUT devices, test connectivity, trigger snapshot, run pre-flight/admin/time-sync— admin UI for the browser-clock fallback flow (with mandatory NOT_FORENSIC acknowledgement)/admin/audit/license-acceptances— viewer of license acceptances (forensic audit)- Test Plans — pick a plan, trigger a run, monitor progress
- Test Runs — history, link to HTML report (Phase 1)
- Grafana metrics — embedded per persona and per protocol (HTTP/2 vs HTTP/3 vs HTTP/3-fallback-TCP)
Summary in one sentence¶
TLSStress.Art is an open-source test bench (PolyForm Noncommercial, Cisco audience) that fills the gap between free tools without scale (TRex, iPerf) and commercial appliances costing up to US$ 500K (Spirent, Ixia) — simulating up to 30 realistic sites (20 Synthetic + 10 operator-Cloned slots) with browser-engine/synthetic-load against the NGFW under test, integrating 4 telemetry pillars (hardware SNMP + syslog correlation + DUT API REST multi-vendor for FTD/Nexus/UCS/Fortinet + OpenTelemetry tracing), with 15 pre-configured test plans, forensically-signed HTML/PDF Test Run Reports, License Acceptance Modal + audit log + tamper-check, and everything tuned from the Linux host to the Nexus switch so that the measured bottleneck is always the NGFW — never the test bed.