TLSStress.Art — System High-Level Design (HLD)¶
Audience: architects, operators, security reviewers, and investors who need the whole-system picture in one document. This HLD sits above the Architecture overview (component tour) and the Architecture Decision Records (per-decision rationale). Where this document summarises, those documents are authoritative on detail.
Companion deep-dives: F2 — Cell-0 (Hetzner) architecture, ZTP-prem security posture, ADR-0053 (cell-based hyperscale).
1. Purpose & scope¶
TLSStress.Art is a test-bed for measuring TLS-decryption performance on Next-Generation Firewalls (NGFWs) acting as the Device Under Test (DUT). The primary target is HTTP/3 (QUIC over UDP/443) — every modern NGFW claims QUIC inspection — with HTTP/2 (TCP/443) measured secondarily for comparison.
The product has two faces:
| Face | What it is | Where it runs |
|---|---|---|
| SaaS control plane | Signup, billing, token (TSU) ledger, auto-provisioning, central enrollment | AWS (App Runner + RDS) · Cloudflare (edge/DNS) · Hetzner cell-0 (K3s) |
| On-prem appliance (TBI) | The bench itself — agents, personas, dashboard, observability — installed next to the customer's NGFW | Customer datacenter / lab (K3s/RKE2 or Docker Desktop dev) |
The bench loads traffic from agents (Playwright + k6) through the NGFW, which decrypts/inspects/re-signs TLS, and on to webservers (the Personas). By comparing handshake timing and throughput with decryption ON vs BYPASS, the operator quantifies the NGFW's TLS-inspection cost.
2. System context — actors¶
| Actor | Role | Touchpoint |
|---|---|---|
| Customer | Buys a package on the SaaS, receives a provisioned tenant + a downloadable on-prem appliance (TBI) | app.tlsstress.art (signup/pay), TBI download (S3/F1) |
| Operator | Runs the bench against a real NGFW. The Dashboard is the ONLY operator interface — all config changes flow through the UI (never kubectl edit / curl admin/ / exec) |
On-prem Dashboard (Next.js) |
| NGFW DUT | The device being measured. L3 default gateway for the test-plane VLANs; performs TLS decryption | Physical/virtual firewall on 802.1q trunks |
flowchart LR
CUST(["Customer"])
OPER(["Operator"])
subgraph SAAS["SaaS control plane (always-online)"]
APP["customer-app<br/>App Runner · Next.js<br/>signup · Stripe · TSU mint"]
CELL["cell-0 (Hetzner K3s)<br/>Provisioning Orchestrator<br/>Temporal saga · tenant RLS"]
TBI[("TBI artifacts<br/>S3 · img/qcow2/oci")]
end
subgraph PREM["On-prem appliance (customer datacenter)"]
DASH["Dashboard (Next.js)<br/>the ONLY operator UI"]
AGENTS["Agents<br/>Playwright + k6"]
PERS["Personas<br/>100 webservers (Caddy)"]
OBS["Observability<br/>Prometheus + Grafana"]
end
DUT{{"NGFW DUT<br/>TLS decrypt · inspect · re-sign"}}
CUST -->|"signup / pay"| APP
APP -->|"applyTier (mint quota)"| APP
APP -->|"enqueueOnboarding (HMAC)"| CELL
CUST -->|"download appliance"| TBI
TBI -.->|"dd to USB / KVM / boot"| PREM
OPER -->|"login · configure"| DASH
DASH --> AGENTS
DASH --> PERS
DASH --> OBS
AGENTS ==>|"TLS leg 1"| DUT
DUT ==>|"TLS leg 2"| PERS
AGENTS -.->|"TSU spend events"| DASH
DASH -.->|"usage report (hourly)"| APP
classDef saas fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
classDef prem fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
classDef dut fill:#fef9c3,stroke:#b45309,stroke-width:2px
class APP,CELL,TBI saas
class DASH,AGENTS,PERS,OBS prem
class DUT dut
3. Container view (C4-ish)¶
The next level zooms into the deployable containers and their protocols. The control plane stays on-prem (Dashboard ↔ agents over the OOBI bridge) while the billing/provisioning control plane lives in the cloud (customer-app ↔ cell-0). The only on-prem→cloud channel is the hourly TSU usage report.
graph TB
CUST(["Customer<br/>browser"])
OPER(["Operator<br/>browser"])
subgraph CLOUD["Cloud SaaS control plane"]
direction TB
APP["customer-app<br/>(App Runner, Next.js)"]
RDS[("RDS · UTXO ledger<br/>mint @ checkout")]
STRIPE{{"Stripe"}}
PROV["Provisioning Orchestrator<br/>(cell-0, Go + Temporal)"]
CELLDB[("Cell control DB<br/>tenants · slots (RLS)")]
end
subgraph ONPREM["On-prem appliance"]
direction TB
DASH["Dashboard<br/>(Next.js · Drizzle)<br/>license authorize() gate"]
PG[("Postgres 16<br/>runs · agents · audit_log<br/>UTXO notes · idempotency")]
PWA["Playwright agents<br/>(1..80, HPA)"]
K6A["k6 agents<br/>(1..200, HPA)"]
PERS["Personas (Caddy)<br/>100 / 20 countries"]
VYOS["MÓDULO ISP/BGP<br/>(VyOS + FRR)"]
PROM["Prometheus + Grafana"]
BOOT["bootstrap-controller<br/>(TSU spend drain)"]
end
DUT{{"NGFW DUT"}}
CUST -->|"HTTPS"| APP
STRIPE -->|"checkout.session.completed"| APP
APP -->|"applyTier mint"| RDS
APP -->|"HMAC POST /trigger"| PROV
PROV -->|"saga"| CELLDB
OPER -->|"HMAC cookie / Basic"| DASH
DASH -->|"SQL"| PG
DASH <-->|"Bearer + Idempotency-Key"| PWA
DASH <-->|"Bearer"| K6A
DASH -->|"licence note spend"| PG
PWA ==>|"TLS leg 1 · h2/h3"| DUT
K6A ==>|"TLS leg 1 · h2/h3"| DUT
DUT ==>|"TLS leg 2 (re-signed)"| VYOS --> PERS
PROM -->|"scrape /api/metrics · SNMP"| DASH
BOOT -.->|"hourly POST /api/usage/report"| APP
classDef cloud fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
classDef prem fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
classDef dut fill:#fef9c3,stroke:#b45309,stroke-width:2px
class APP,RDS,STRIPE,PROV,CELLDB cloud
class DASH,PG,PWA,K6A,PERS,VYOS,PROM,BOOT prem
class DUT dut
4. Major components & responsibilities¶
4.1 On-prem appliance (the bench)¶
| Component | Code | Responsibility |
|---|---|---|
| Dashboard | dashboard/ |
Next.js 15 App Router. The only operator UI. HMAC-cookie auth, audit_log of every mutation, SSE live stream, Prometheus /api/metrics, the license authorize() gate (dashboard/src/lib/license/). |
| Playwright agents | agent/ |
Headless Chromium workers. Fresh BrowserContext per cycle, capture protocol / tcp_sockets_open / web-vitals. Per-target circuit breaker, idempotency keys. HPA min=1, max=80. |
| k6 agents | k6-agent/ |
Node.js wrapper around the grafana/k6 binary. Emits p50/p95/p99, error_rate, rps, data_received. readOnlyRootFilesystem, memory-backed /tmp. HPA min=1, max=200. |
| Personas | personas/, webserver/ |
100 Caddy webservers across 20 countries (5 each). H2 + H3, TLS 1.2 min, session tickets disabled. Certs issued by persona-ca-issuer. The TLS leg-2 endpoints. |
| MÓDULO ISP / BGP | bgp-router-peer/, ospf-router-peer/, vpn-remote/ |
VyOS + FRR. The NGFW's default route; routes onward to each country /24. Also control-plane stress (BGP/OSPF LSA injection). |
| Postgres 16 | k8s/40-postgres.yaml |
runs, agents, k6_*, metrics_buckets, audit_log, idempotency, UTXO notes, license envelopes. Daily pg_dump CronJob. |
| Observability | observability/, grafana/ |
Prometheus + Grafana, SNMP exporter for NGFW/Nexus, node_exporter. DUT dashboards (NGFW, Nexus, decrypt-mode probe). |
| DUT API engine | dashboard/src/lib/dut-api/ |
Vendor adapters (Cisco FTD/Nexus/UCS-CIMC, FortiGate). Captures sanitized config + identity, AES-GCM at rest, SHA-256 snapshot hashes embedded in the Test Run Report. |
| bootstrap-controller | pkg/bootstrap-controller/ |
Drains TSU spend files and POSTs them to the cloud /api/usage/report hourly. |
4.2 Cloud SaaS control plane¶
| Component | Code | Responsibility |
|---|---|---|
| customer-app | pkg/octopus/customer-app/ |
Next.js on AWS App Runner. Signup, Stripe checkout, mints TSU quota into the RDS UTXO ledger at checkout (applyTier), and (fail-safe) enqueues onboarding to cell-0. |
| Provisioning Orchestrator | pkg/octopus/provisioning-orchestrator/ |
Go + Temporal worker on cell-0. Runs the OnboardingWorkflow — a 12-activity saga with LIFO compensation. See F2 HLD. |
| Cell control DB | internal/cell/migrations/ |
Postgres cell database: tenants, slots, tenant_data with Postgres RLS for per-tenant isolation (ADR-0053). |
| Admin console | pkg/octopus/admin-console/ |
Internal operator/billing console (separate from the on-prem Dashboard). |
| TBI builder | build/tbi/, pkg/tbi-agent/ |
Builds the on-prem appliance image (mkosi → img/qcow2/oci). Ships tbi-agent, k3s, gVisor, hardware probes. |
5. The two TLS legs¶
The whole product exists to measure the cost of decryption between two TLS legs.
flowchart TB
A["Agent pod<br/>(Playwright / k6)<br/>VLAN 20 / 30 — TRUST/INSIDE"]
DUT{{"NGFW DUT<br/>TLS decrypt · inspect · re-sign"}}
VYOS["MÓDULO ISP (VyOS + FRR)<br/>VLAN 2001 — UNTRUST/OUTSIDE<br/>default route 200.130.0.2"]
PERS["Persona Caddy pod<br/>e.g. shop.us.persona.tlsstress.art<br/>= 198.32.10.10 · H2 + H3"]
A ==>|"TLS leg 1<br/>agent trusts NGFW CA"| DUT
DUT ==>|"TLS leg 2<br/>NGFW trusts persona-ca-issuer"| VYOS
VYOS -->|"one egress per country LAN"| PERS
note1["Leg 1 trust: agents load NODE_EXTRA_CA_CERTS /<br/>SSL_CERT_FILE = NGFW CA (10-ngfw-ca ConfigMap)"]
note2["Leg 2 trust: NGFW trusts cert-manager<br/>persona-ca-issuer (platform/pki/)"]
note1 -.-> DUT
note2 -.-> VYOS
classDef agent fill:#dbeafe,stroke:#2563eb,stroke-width:1px
classDef dut fill:#fef9c3,stroke:#b45309,stroke-width:2px
classDef rtr fill:#fff7ed,stroke:#c2410c,stroke-width:1px
classDef pers fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
classDef noted fill:#f5f5f7,stroke:#999,stroke-width:1px,color:#333
class A agent
class DUT dut
class VYOS rtr
class PERS pers
class note1,note2 noted
| Leg | From → To | Trust anchor | Notes |
|---|---|---|---|
| Leg 1 | Agents → NGFW | Agents trust the NGFW CA (k8s/dut/10-ngfw-ca.yaml, injected via NODE_EXTRA_CA_CERTS / SSL_CERT_FILE) |
The NGFW intercepts and re-signs with its own CA. Agents must accept it to measure the inspected path. |
| Leg 2 | NGFW → Personas | NGFW trusts the persona-ca-issuer (cert-manager PKI, platform/pki/) |
Personas present certs with IP SANs (real public IP) + DNS <persona>.<country>.persona.tlsstress.art. |
The NGFW only sees the single edge link (VLAN 2001); the VyOS-ISP pod routes
onward to each country /24. The NGFW interface inventory is therefore 8
interfaces (3 INSIDE + 4 OUTSIDE + 1 MGMT), not one per persona.
6. Data planes — eth0 / net1 / VLANs¶
Each pod is dual-homed. eth0 carries control + observability and never leaves the host (OOBI bridge); net1 is a macvlan that bypasses iptables/NetworkPolicy and carries the actual test traffic onto the VLAN trunk. This separation is what lets the bench drive line-rate traffic through a physical NGFW while keeping orchestration on a clean control channel.
flowchart TB
subgraph POD["Every data-plane pod"]
ETH0(["eth0<br/>K8s default net"])
NET1(["net1<br/>macvlan (Multus)"])
end
subgraph CTRL["Control + observability plane (eth0)"]
DASH["Dashboard :3000"]
PG["Postgres :5432"]
PROM["Prometheus :9090"]
end
subgraph DATA["Data plane (net1 → 802.1q trunk)"]
V20["VLAN 20 · 172.16.0.0/16<br/>Playwright agents (TRUST)"]
V30["VLAN 30 · 172.17.0.0/16<br/>k6 agents (TRUST)"]
V99["VLAN 99 · 10.254.254.0/24<br/>OOBI / SNMP mgmt"]
V40["VLAN 40 · DHCP<br/>Cloner ISP (NO NGFW)"]
VCO["VLANs 101-120<br/>Personas — 1 VLAN/country<br/>real public /24"]
end
DUT{{"NGFW DUT"}}
NET2["VyOS-ISP → country LANs"]
INET(["Public Internet"])
ETH0 --> CTRL
NET1 --> V20 --> DUT
NET1 --> V30 --> DUT
NET1 --> V99
NET1 --> V40 --> INET
DUT --> NET2 --> VCO
classDef ctrl fill:#fce7f3,stroke:#db2777,stroke-width:1px
classDef data fill:#dbeafe,stroke:#2563eb,stroke-width:1px
classDef dut fill:#fef9c3,stroke:#b45309,stroke-width:2px
classDef ext fill:#f5f5f7,stroke:#000,stroke-width:1px
class DASH,PG,PROM ctrl
class V20,V30,V99,V40,VCO data
class DUT dut
class INET,NET2 ext
| Plane | Interface | Segment(s) | Purpose |
|---|---|---|---|
| Control / mgmt | eth0 | K8s default net | Dashboard API, heartbeat, Prometheus scrape — never leaves the host |
| Data — Playwright | net1 | VLAN 20 172.16.0.0/16 |
Browser-driven load through the NGFW |
| Data — k6 | net1 | VLAN 30 172.17.0.0/16 |
Synthetic TLS/HTTP load through the NGFW |
| SNMP / OOBI | net1 | VLAN 99 10.254.254.0/24 (VXLAN VNI 254254) |
SNMP polling + OOBI orchestration fabric |
| Personas | net1 | VLANs 101-120 — one VLAN per country, real public /24 |
TLS leg-2 endpoints (behind VyOS-ISP) |
| Cloner ISP | net1 | VLAN 40 (DHCP) | Direct internet egress for the website cloner — bypasses the NGFW |
Schema v4.3 / ADR-0007 (Public-Internet Realism): personas carry only
countryinpersonas.yaml; the VLAN id, public/24prefix, and gateway are derived via lookup againstplatform/network/public-ip-pool.yaml. RFC1918 persona addressing (10.1.x.0/27) was the stale v4.2 state and must not be used.
7. The cell substrate (ADR-0053)¶
The SaaS side is built cell-based from Wave-0 to scale from dozens to hundreds of thousands of customers without a pause-the-world migration.
| Concept | Definition |
|---|---|
| Cell | Isolation unit of up to 10 000 customers with its own OCTOPUS stack. A cell crash never affects other cells. cell_id = <region>-<n> (e.g. hel1-0). |
| Sharding key | deployment_id (ULID / dpl_*) — the unit of traffic, which is what saturates. |
| Control plane | Global services across all cells: Admin Console, Provisioning Orchestrator, token marketplace, PKI root, audit chain. |
| Data plane | Per-cell, cell-isolated customer traffic (CONNECT.Art rendezvous + STUN-coord signaling). |
Wave-1 (live today) is a single cell — hel1-0 on a Hetzner VPS — but every
schema is already cell-ready (cell_id column on every primary key) so later
waves are incremental, never a rewrite. Per-tenant isolation uses Postgres
RLS in a shared schema (scales to 10k tenants/cell without 10k schemas): the
tenant_data table has ENABLE/FORCE ROW LEVEL SECURITY with policy
USING (deployment_id = current_setting('octopus.tenant')). The control-plane
runs privileged; the data-plane connects as the tenant_dataplane role
(NOSUPERUSER NOBYPASSRLS), so cross-tenant reads return zero rows.
A slot (connect-art / stun-coord) is an admission entry, not a scarce
runtime resource: the data-plane services read slots to admit/reject a
deployment_id. See F2 HLD §4.6 and
ADR-0053.
flowchart LR
subgraph CP["Control plane (global)"]
ADMIN["Admin Console"]
ORCH["Provisioning Orchestrator"]
PKI["PKI root / audit chain"]
end
subgraph CELL["Cell hel1-0 (Wave-1)"]
direction TB
CA["CONNECT.Art"]
SC["STUN-coord"]
DB[("tenants · slots · tenant_data<br/>(RLS, shared schema)")]
end
ORCH -->|"AllocateCell · ProvisionTenant · ReserveSlot"| DB
CA -->|"read slots → admit"| DB
SC -->|"read slots → admit"| DB
ADMIN -.-> ORCH
classDef cp fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
classDef cell fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
class ADMIN,ORCH,PKI cp
class CA,SC,DB cell
8. The token economy (TSU)¶
Billing is metered in TSU (TLSStress Units). The flow is deliberately split so the authoritative balance lives in the cloud while spend originates on-prem.
sequenceDiagram
autonumber
participant C as Customer
participant APP as customer-app (AWS)
participant RDS as RDS UTXO ledger
participant DASH as On-prem Dashboard
participant GATE as license authorize() gate
participant PG as On-prem Postgres (UTXO notes)
participant BOOT as bootstrap-controller
participant USAGE as /api/usage/report (cloud)
C->>APP: checkout (Stripe)
APP->>RDS: applyTier — MINT quota (authoritative)
Note over DASH,GATE: every test plan execution passes the gate
DASH->>GATE: authorize(testPlanId, estimatedTokens, features)
GATE->>PG: spendNotes() — burn UTXO note serials
GATE->>PG: sealedAppend() — WORM audit record
GATE-->>DASH: receipt {sealedHash, burnedSerials, balanceAfter}
DASH->>DASH: run executes — agents emit Tick(TSU) per finished run
BOOT->>BOOT: drain /var/lib/tlsstress/spend/*.jsonl (hourly)
BOOT->>USAGE: POST usage events (stable client_seq → dedup)
Key properties (faithful to pkg/metering/README.md, dashboard/src/lib/license/):
- Authoritative mint at checkout — the customer-app mints quota into the RDS
UTXO ledger (tokens are tamper-evident notes, not a mutable balance;
Patent #19, no
balancecolumn, no update race). - On-prem spend — each module emits a
Tick(TSU)per finished unit of work to/var/lib/tlsstress/spend/<module>.jsonl. A broken meter must not crash the module (falls back to stderr). - License gate chokepoint —
authorize()indashboard/src/lib/license/gate.tsis the single chokepoint every test run passes through: it spends UTXO notes, appends to the sealed audit hash-chain, and returns a signed receipt. Wave-1 is always-allow (developer envelope); enforcement (signature verify, feature gating, concurrency caps, expiry+grace) flips on under the same signature in v5.3+. - Hourly reconciliation — the bootstrap-controller drains spend files and
POSTs them to the cloud
/api/usage/report; idempotency is the controller's job (stableclient_seqper event → cloud rejects duplicate on retry).
9. Security posture (high level)¶
The complete 12-layer Zero-Trust-on-Premises (ZTP-prem) posture — which defends the bench against an insider operator with
kubectl+ root inside the customer's own org — is in SECURITY_ZTP_PREM.md. This section is the operational baseline that applies regardless of ZTP-prem layer-flip state.
| Layer | Control |
|---|---|
| Containers | Non-root, dropped Linux capabilities, RuntimeDefault seccomp, read-only root filesystem where possible. |
| Network | Egress NetworkPolicy default-deny; RFC1918 + link-local blocked from the agent fleet. Personas behind VyOS-ISP; OOBI lateral movement blocked (only the Infra Stack VIP 10.254.254.100 may initiate to MÓDULO OOBI IPs). |
| Auth | Dashboard HMAC-signed HttpOnly cookie (constant-time compare, per-IP rate limit); legacy Authorization: Basic for programmatic callers. The cloud trigger is HMAC-SHA256 with constant-time verify. |
| Supply chain | gitleaks + trivy per PR, CodeQL weekly; images Cosign-keyless signed via GitHub OIDC; SPDX SBOM attached per release tag. |
| Pre-flight gate | 5-check catalog before every run (subnet conflict, NGFW reachability + decrypt mode, persona PKI freshness, NTP skew, DUT API auth). Bypass is logged in audit_log + printed on the report cover. |
| Data at rest | DUT API credentials AES-GCM encrypted; sanitized snapshots SHA-256 hashed into the Test Run Report (Annex B/C/D). |
| Tenant isolation (cloud) | Postgres RLS with the tenant_dataplane role (NOBYPASSRLS); cross-tenant read = 0 rows (proven live). |
| Auditability | audit_log records every config mutation (actor, source IP, before/after JSON); ZTP-prem sealed audit hash-chain (WORM) records every license/admission/egress decision. |
10. Deployment topology (ADR-0011)¶
Topology is configured via platform/topology.yaml along three independent
axes, so a single-node lab and a multi-node prod fabric share one codebase.
| Axis | Values | Controls |
|---|---|---|
deployment_nodes |
single · dual · tri · multi |
Number of UCS hosts + role distribution |
l2_fabric |
nexus · none |
External L2 switch (or its absence) |
dut_type |
cisco-ftd · cisco-secure-router |
DUT vendor — gates apply/verify scripts |
Single-node without an external switch is first-class (l2_fabric: none):
UCS NICs cable directly into the NGFW, Nexus tuning is skipped, but Linux-bridge
BPDU attestation still runs. The OOBI (eth0) plane is mandatory on every host in
every mode. Kustomize overlays (overlays/dual-node|tri-node|multi-node/) pin
each workload tier to its dedicated host via nodeSelector patches.
11. Cross-references¶
| Topic | Document |
|---|---|
| Component tour, sequence diagrams, DUT topology | docs/ARCHITECTURE.md |
| F2 auto-provisioning + cell-0 deep-dive | docs/F2_HETZNER_CELL0_ARCHITECTURE.md |
| ZTP-prem 12-layer security | docs/SECURITY_ZTP_PREM.md |
| Cell-based hyperscale rationale | ADR-0053 |
| Decision records index | ADR/README.md |
| Project conventions & rules | CLAUDE.md |