Skip to content

TLSStress.Art — System High-Level Design (HLD)

Audience: architects, operators, security reviewers, and investors who need the whole-system picture in one document. This HLD sits above the Architecture overview (component tour) and the Architecture Decision Records (per-decision rationale). Where this document summarises, those documents are authoritative on detail.

Companion deep-dives: F2 — Cell-0 (Hetzner) architecture, ZTP-prem security posture, ADR-0053 (cell-based hyperscale).


1. Purpose & scope

TLSStress.Art is a test-bed for measuring TLS-decryption performance on Next-Generation Firewalls (NGFWs) acting as the Device Under Test (DUT). The primary target is HTTP/3 (QUIC over UDP/443) — every modern NGFW claims QUIC inspection — with HTTP/2 (TCP/443) measured secondarily for comparison.

The product has two faces:

Face What it is Where it runs
SaaS control plane Signup, billing, token (TSU) ledger, auto-provisioning, central enrollment AWS (App Runner + RDS) · Cloudflare (edge/DNS) · Hetzner cell-0 (K3s)
On-prem appliance (TBI) The bench itself — agents, personas, dashboard, observability — installed next to the customer's NGFW Customer datacenter / lab (K3s/RKE2 or Docker Desktop dev)

The bench loads traffic from agents (Playwright + k6) through the NGFW, which decrypts/inspects/re-signs TLS, and on to webservers (the Personas). By comparing handshake timing and throughput with decryption ON vs BYPASS, the operator quantifies the NGFW's TLS-inspection cost.


2. System context — actors

Actor Role Touchpoint
Customer Buys a package on the SaaS, receives a provisioned tenant + a downloadable on-prem appliance (TBI) app.tlsstress.art (signup/pay), TBI download (S3/F1)
Operator Runs the bench against a real NGFW. The Dashboard is the ONLY operator interface — all config changes flow through the UI (never kubectl edit / curl admin/ / exec) On-prem Dashboard (Next.js)
NGFW DUT The device being measured. L3 default gateway for the test-plane VLANs; performs TLS decryption Physical/virtual firewall on 802.1q trunks
flowchart LR
    CUST(["Customer"])
    OPER(["Operator"])

    subgraph SAAS["SaaS control plane (always-online)"]
        APP["customer-app<br/>App Runner · Next.js<br/>signup · Stripe · TSU mint"]
        CELL["cell-0 (Hetzner K3s)<br/>Provisioning Orchestrator<br/>Temporal saga · tenant RLS"]
        TBI[("TBI artifacts<br/>S3 · img/qcow2/oci")]
    end

    subgraph PREM["On-prem appliance (customer datacenter)"]
        DASH["Dashboard (Next.js)<br/>the ONLY operator UI"]
        AGENTS["Agents<br/>Playwright + k6"]
        PERS["Personas<br/>100 webservers (Caddy)"]
        OBS["Observability<br/>Prometheus + Grafana"]
    end

    DUT{{"NGFW DUT<br/>TLS decrypt · inspect · re-sign"}}

    CUST -->|"signup / pay"| APP
    APP -->|"applyTier (mint quota)"| APP
    APP -->|"enqueueOnboarding (HMAC)"| CELL
    CUST -->|"download appliance"| TBI
    TBI -.->|"dd to USB / KVM / boot"| PREM

    OPER -->|"login · configure"| DASH
    DASH --> AGENTS
    DASH --> PERS
    DASH --> OBS
    AGENTS ==>|"TLS leg 1"| DUT
    DUT ==>|"TLS leg 2"| PERS
    AGENTS -.->|"TSU spend events"| DASH
    DASH -.->|"usage report (hourly)"| APP

    classDef saas fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
    classDef prem fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
    classDef dut  fill:#fef9c3,stroke:#b45309,stroke-width:2px
    class APP,CELL,TBI saas
    class DASH,AGENTS,PERS,OBS prem
    class DUT dut

3. Container view (C4-ish)

The next level zooms into the deployable containers and their protocols. The control plane stays on-prem (Dashboard ↔ agents over the OOBI bridge) while the billing/provisioning control plane lives in the cloud (customer-app ↔ cell-0). The only on-prem→cloud channel is the hourly TSU usage report.

graph TB
  CUST(["Customer<br/>browser"])
  OPER(["Operator<br/>browser"])

  subgraph CLOUD["Cloud SaaS control plane"]
    direction TB
    APP["customer-app<br/>(App Runner, Next.js)"]
    RDS[("RDS · UTXO ledger<br/>mint @ checkout")]
    STRIPE{{"Stripe"}}
    PROV["Provisioning Orchestrator<br/>(cell-0, Go + Temporal)"]
    CELLDB[("Cell control DB<br/>tenants · slots (RLS)")]
  end

  subgraph ONPREM["On-prem appliance"]
    direction TB
    DASH["Dashboard<br/>(Next.js · Drizzle)<br/>license authorize() gate"]
    PG[("Postgres 16<br/>runs · agents · audit_log<br/>UTXO notes · idempotency")]
    PWA["Playwright agents<br/>(1..80, HPA)"]
    K6A["k6 agents<br/>(1..200, HPA)"]
    PERS["Personas (Caddy)<br/>100 / 20 countries"]
    VYOS["MÓDULO ISP/BGP<br/>(VyOS + FRR)"]
    PROM["Prometheus + Grafana"]
    BOOT["bootstrap-controller<br/>(TSU spend drain)"]
  end

  DUT{{"NGFW DUT"}}

  CUST -->|"HTTPS"| APP
  STRIPE -->|"checkout.session.completed"| APP
  APP -->|"applyTier mint"| RDS
  APP -->|"HMAC POST /trigger"| PROV
  PROV -->|"saga"| CELLDB

  OPER -->|"HMAC cookie / Basic"| DASH
  DASH -->|"SQL"| PG
  DASH <-->|"Bearer + Idempotency-Key"| PWA
  DASH <-->|"Bearer"| K6A
  DASH -->|"licence note spend"| PG
  PWA ==>|"TLS leg 1 · h2/h3"| DUT
  K6A ==>|"TLS leg 1 · h2/h3"| DUT
  DUT ==>|"TLS leg 2 (re-signed)"| VYOS --> PERS
  PROM -->|"scrape /api/metrics · SNMP"| DASH
  BOOT -.->|"hourly POST /api/usage/report"| APP

  classDef cloud fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
  classDef prem  fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
  classDef dut   fill:#fef9c3,stroke:#b45309,stroke-width:2px
  class APP,RDS,STRIPE,PROV,CELLDB cloud
  class DASH,PG,PWA,K6A,PERS,VYOS,PROM,BOOT prem
  class DUT dut

4. Major components & responsibilities

4.1 On-prem appliance (the bench)

Component Code Responsibility
Dashboard dashboard/ Next.js 15 App Router. The only operator UI. HMAC-cookie auth, audit_log of every mutation, SSE live stream, Prometheus /api/metrics, the license authorize() gate (dashboard/src/lib/license/).
Playwright agents agent/ Headless Chromium workers. Fresh BrowserContext per cycle, capture protocol / tcp_sockets_open / web-vitals. Per-target circuit breaker, idempotency keys. HPA min=1, max=80.
k6 agents k6-agent/ Node.js wrapper around the grafana/k6 binary. Emits p50/p95/p99, error_rate, rps, data_received. readOnlyRootFilesystem, memory-backed /tmp. HPA min=1, max=200.
Personas personas/, webserver/ 100 Caddy webservers across 20 countries (5 each). H2 + H3, TLS 1.2 min, session tickets disabled. Certs issued by persona-ca-issuer. The TLS leg-2 endpoints.
MÓDULO ISP / BGP bgp-router-peer/, ospf-router-peer/, vpn-remote/ VyOS + FRR. The NGFW's default route; routes onward to each country /24. Also control-plane stress (BGP/OSPF LSA injection).
Postgres 16 k8s/40-postgres.yaml runs, agents, k6_*, metrics_buckets, audit_log, idempotency, UTXO notes, license envelopes. Daily pg_dump CronJob.
Observability observability/, grafana/ Prometheus + Grafana, SNMP exporter for NGFW/Nexus, node_exporter. DUT dashboards (NGFW, Nexus, decrypt-mode probe).
DUT API engine dashboard/src/lib/dut-api/ Vendor adapters (Cisco FTD/Nexus/UCS-CIMC, FortiGate). Captures sanitized config + identity, AES-GCM at rest, SHA-256 snapshot hashes embedded in the Test Run Report.
bootstrap-controller pkg/bootstrap-controller/ Drains TSU spend files and POSTs them to the cloud /api/usage/report hourly.

4.2 Cloud SaaS control plane

Component Code Responsibility
customer-app pkg/octopus/customer-app/ Next.js on AWS App Runner. Signup, Stripe checkout, mints TSU quota into the RDS UTXO ledger at checkout (applyTier), and (fail-safe) enqueues onboarding to cell-0.
Provisioning Orchestrator pkg/octopus/provisioning-orchestrator/ Go + Temporal worker on cell-0. Runs the OnboardingWorkflow — a 12-activity saga with LIFO compensation. See F2 HLD.
Cell control DB internal/cell/migrations/ Postgres cell database: tenants, slots, tenant_data with Postgres RLS for per-tenant isolation (ADR-0053).
Admin console pkg/octopus/admin-console/ Internal operator/billing console (separate from the on-prem Dashboard).
TBI builder build/tbi/, pkg/tbi-agent/ Builds the on-prem appliance image (mkosi → img/qcow2/oci). Ships tbi-agent, k3s, gVisor, hardware probes.

5. The two TLS legs

The whole product exists to measure the cost of decryption between two TLS legs.

flowchart TB
    A["Agent pod<br/>(Playwright / k6)<br/>VLAN 20 / 30 — TRUST/INSIDE"]
    DUT{{"NGFW DUT<br/>TLS decrypt · inspect · re-sign"}}
    VYOS["MÓDULO ISP (VyOS + FRR)<br/>VLAN 2001 — UNTRUST/OUTSIDE<br/>default route 200.130.0.2"]
    PERS["Persona Caddy pod<br/>e.g. shop.us.persona.tlsstress.art<br/>= 198.32.10.10 · H2 + H3"]

    A ==>|"TLS leg 1<br/>agent trusts NGFW CA"| DUT
    DUT ==>|"TLS leg 2<br/>NGFW trusts persona-ca-issuer"| VYOS
    VYOS -->|"one egress per country LAN"| PERS

    note1["Leg 1 trust: agents load NODE_EXTRA_CA_CERTS /<br/>SSL_CERT_FILE = NGFW CA (10-ngfw-ca ConfigMap)"]
    note2["Leg 2 trust: NGFW trusts cert-manager<br/>persona-ca-issuer (platform/pki/)"]
    note1 -.-> DUT
    note2 -.-> VYOS

    classDef agent fill:#dbeafe,stroke:#2563eb,stroke-width:1px
    classDef dut   fill:#fef9c3,stroke:#b45309,stroke-width:2px
    classDef rtr   fill:#fff7ed,stroke:#c2410c,stroke-width:1px
    classDef pers  fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
    classDef noted fill:#f5f5f7,stroke:#999,stroke-width:1px,color:#333
    class A agent
    class DUT dut
    class VYOS rtr
    class PERS pers
    class note1,note2 noted
Leg From → To Trust anchor Notes
Leg 1 Agents → NGFW Agents trust the NGFW CA (k8s/dut/10-ngfw-ca.yaml, injected via NODE_EXTRA_CA_CERTS / SSL_CERT_FILE) The NGFW intercepts and re-signs with its own CA. Agents must accept it to measure the inspected path.
Leg 2 NGFW → Personas NGFW trusts the persona-ca-issuer (cert-manager PKI, platform/pki/) Personas present certs with IP SANs (real public IP) + DNS <persona>.<country>.persona.tlsstress.art.

The NGFW only sees the single edge link (VLAN 2001); the VyOS-ISP pod routes onward to each country /24. The NGFW interface inventory is therefore 8 interfaces (3 INSIDE + 4 OUTSIDE + 1 MGMT), not one per persona.


6. Data planes — eth0 / net1 / VLANs

Each pod is dual-homed. eth0 carries control + observability and never leaves the host (OOBI bridge); net1 is a macvlan that bypasses iptables/NetworkPolicy and carries the actual test traffic onto the VLAN trunk. This separation is what lets the bench drive line-rate traffic through a physical NGFW while keeping orchestration on a clean control channel.

flowchart TB
    subgraph POD["Every data-plane pod"]
        ETH0(["eth0<br/>K8s default net"])
        NET1(["net1<br/>macvlan (Multus)"])
    end

    subgraph CTRL["Control + observability plane (eth0)"]
        DASH["Dashboard :3000"]
        PG["Postgres :5432"]
        PROM["Prometheus :9090"]
    end

    subgraph DATA["Data plane (net1 → 802.1q trunk)"]
        V20["VLAN 20 · 172.16.0.0/16<br/>Playwright agents (TRUST)"]
        V30["VLAN 30 · 172.17.0.0/16<br/>k6 agents (TRUST)"]
        V99["VLAN 99 · 10.254.254.0/24<br/>OOBI / SNMP mgmt"]
        V40["VLAN 40 · DHCP<br/>Cloner ISP (NO NGFW)"]
        VCO["VLANs 101-120<br/>Personas — 1 VLAN/country<br/>real public /24"]
    end

    DUT{{"NGFW DUT"}}
    NET2["VyOS-ISP → country LANs"]
    INET(["Public Internet"])

    ETH0 --> CTRL
    NET1 --> V20 --> DUT
    NET1 --> V30 --> DUT
    NET1 --> V99
    NET1 --> V40 --> INET
    DUT --> NET2 --> VCO

    classDef ctrl fill:#fce7f3,stroke:#db2777,stroke-width:1px
    classDef data fill:#dbeafe,stroke:#2563eb,stroke-width:1px
    classDef dut  fill:#fef9c3,stroke:#b45309,stroke-width:2px
    classDef ext  fill:#f5f5f7,stroke:#000,stroke-width:1px
    class DASH,PG,PROM ctrl
    class V20,V30,V99,V40,VCO data
    class DUT dut
    class INET,NET2 ext
Plane Interface Segment(s) Purpose
Control / mgmt eth0 K8s default net Dashboard API, heartbeat, Prometheus scrape — never leaves the host
Data — Playwright net1 VLAN 20 172.16.0.0/16 Browser-driven load through the NGFW
Data — k6 net1 VLAN 30 172.17.0.0/16 Synthetic TLS/HTTP load through the NGFW
SNMP / OOBI net1 VLAN 99 10.254.254.0/24 (VXLAN VNI 254254) SNMP polling + OOBI orchestration fabric
Personas net1 VLANs 101-120 — one VLAN per country, real public /24 TLS leg-2 endpoints (behind VyOS-ISP)
Cloner ISP net1 VLAN 40 (DHCP) Direct internet egress for the website cloner — bypasses the NGFW

Schema v4.3 / ADR-0007 (Public-Internet Realism): personas carry only country in personas.yaml; the VLAN id, public /24 prefix, and gateway are derived via lookup against platform/network/public-ip-pool.yaml. RFC1918 persona addressing (10.1.x.0/27) was the stale v4.2 state and must not be used.


7. The cell substrate (ADR-0053)

The SaaS side is built cell-based from Wave-0 to scale from dozens to hundreds of thousands of customers without a pause-the-world migration.

Concept Definition
Cell Isolation unit of up to 10 000 customers with its own OCTOPUS stack. A cell crash never affects other cells. cell_id = <region>-<n> (e.g. hel1-0).
Sharding key deployment_id (ULID / dpl_*) — the unit of traffic, which is what saturates.
Control plane Global services across all cells: Admin Console, Provisioning Orchestrator, token marketplace, PKI root, audit chain.
Data plane Per-cell, cell-isolated customer traffic (CONNECT.Art rendezvous + STUN-coord signaling).

Wave-1 (live today) is a single cell — hel1-0 on a Hetzner VPS — but every schema is already cell-ready (cell_id column on every primary key) so later waves are incremental, never a rewrite. Per-tenant isolation uses Postgres RLS in a shared schema (scales to 10k tenants/cell without 10k schemas): the tenant_data table has ENABLE/FORCE ROW LEVEL SECURITY with policy USING (deployment_id = current_setting('octopus.tenant')). The control-plane runs privileged; the data-plane connects as the tenant_dataplane role (NOSUPERUSER NOBYPASSRLS), so cross-tenant reads return zero rows.

A slot (connect-art / stun-coord) is an admission entry, not a scarce runtime resource: the data-plane services read slots to admit/reject a deployment_id. See F2 HLD §4.6 and ADR-0053.

flowchart LR
    subgraph CP["Control plane (global)"]
        ADMIN["Admin Console"]
        ORCH["Provisioning Orchestrator"]
        PKI["PKI root / audit chain"]
    end
    subgraph CELL["Cell hel1-0 (Wave-1)"]
        direction TB
        CA["CONNECT.Art"]
        SC["STUN-coord"]
        DB[("tenants · slots · tenant_data<br/>(RLS, shared schema)")]
    end
    ORCH -->|"AllocateCell · ProvisionTenant · ReserveSlot"| DB
    CA -->|"read slots → admit"| DB
    SC -->|"read slots → admit"| DB
    ADMIN -.-> ORCH

    classDef cp   fill:#0F1A2E,stroke:#00D8FC,color:#FCFCFC,stroke-width:2px
    classDef cell fill:#f0fdf4,stroke:#16a34a,stroke-width:1px
    class ADMIN,ORCH,PKI cp
    class CA,SC,DB cell

8. The token economy (TSU)

Billing is metered in TSU (TLSStress Units). The flow is deliberately split so the authoritative balance lives in the cloud while spend originates on-prem.

sequenceDiagram
    autonumber
    participant C as Customer
    participant APP as customer-app (AWS)
    participant RDS as RDS UTXO ledger
    participant DASH as On-prem Dashboard
    participant GATE as license authorize() gate
    participant PG as On-prem Postgres (UTXO notes)
    participant BOOT as bootstrap-controller
    participant USAGE as /api/usage/report (cloud)

    C->>APP: checkout (Stripe)
    APP->>RDS: applyTier — MINT quota (authoritative)
    Note over DASH,GATE: every test plan execution passes the gate
    DASH->>GATE: authorize(testPlanId, estimatedTokens, features)
    GATE->>PG: spendNotes() — burn UTXO note serials
    GATE->>PG: sealedAppend() — WORM audit record
    GATE-->>DASH: receipt {sealedHash, burnedSerials, balanceAfter}
    DASH->>DASH: run executes — agents emit Tick(TSU) per finished run
    BOOT->>BOOT: drain /var/lib/tlsstress/spend/*.jsonl (hourly)
    BOOT->>USAGE: POST usage events (stable client_seq → dedup)

Key properties (faithful to pkg/metering/README.md, dashboard/src/lib/license/):

  • Authoritative mint at checkout — the customer-app mints quota into the RDS UTXO ledger (tokens are tamper-evident notes, not a mutable balance; Patent #19, no balance column, no update race).
  • On-prem spend — each module emits a Tick(TSU) per finished unit of work to /var/lib/tlsstress/spend/<module>.jsonl. A broken meter must not crash the module (falls back to stderr).
  • License gate chokepointauthorize() in dashboard/src/lib/license/gate.ts is the single chokepoint every test run passes through: it spends UTXO notes, appends to the sealed audit hash-chain, and returns a signed receipt. Wave-1 is always-allow (developer envelope); enforcement (signature verify, feature gating, concurrency caps, expiry+grace) flips on under the same signature in v5.3+.
  • Hourly reconciliation — the bootstrap-controller drains spend files and POSTs them to the cloud /api/usage/report; idempotency is the controller's job (stable client_seq per event → cloud rejects duplicate on retry).

9. Security posture (high level)

The complete 12-layer Zero-Trust-on-Premises (ZTP-prem) posture — which defends the bench against an insider operator with kubectl + root inside the customer's own org — is in SECURITY_ZTP_PREM.md. This section is the operational baseline that applies regardless of ZTP-prem layer-flip state.

Layer Control
Containers Non-root, dropped Linux capabilities, RuntimeDefault seccomp, read-only root filesystem where possible.
Network Egress NetworkPolicy default-deny; RFC1918 + link-local blocked from the agent fleet. Personas behind VyOS-ISP; OOBI lateral movement blocked (only the Infra Stack VIP 10.254.254.100 may initiate to MÓDULO OOBI IPs).
Auth Dashboard HMAC-signed HttpOnly cookie (constant-time compare, per-IP rate limit); legacy Authorization: Basic for programmatic callers. The cloud trigger is HMAC-SHA256 with constant-time verify.
Supply chain gitleaks + trivy per PR, CodeQL weekly; images Cosign-keyless signed via GitHub OIDC; SPDX SBOM attached per release tag.
Pre-flight gate 5-check catalog before every run (subnet conflict, NGFW reachability + decrypt mode, persona PKI freshness, NTP skew, DUT API auth). Bypass is logged in audit_log + printed on the report cover.
Data at rest DUT API credentials AES-GCM encrypted; sanitized snapshots SHA-256 hashed into the Test Run Report (Annex B/C/D).
Tenant isolation (cloud) Postgres RLS with the tenant_dataplane role (NOBYPASSRLS); cross-tenant read = 0 rows (proven live).
Auditability audit_log records every config mutation (actor, source IP, before/after JSON); ZTP-prem sealed audit hash-chain (WORM) records every license/admission/egress decision.

10. Deployment topology (ADR-0011)

Topology is configured via platform/topology.yaml along three independent axes, so a single-node lab and a multi-node prod fabric share one codebase.

Axis Values Controls
deployment_nodes single · dual · tri · multi Number of UCS hosts + role distribution
l2_fabric nexus · none External L2 switch (or its absence)
dut_type cisco-ftd · cisco-secure-router DUT vendor — gates apply/verify scripts

Single-node without an external switch is first-class (l2_fabric: none): UCS NICs cable directly into the NGFW, Nexus tuning is skipped, but Linux-bridge BPDU attestation still runs. The OOBI (eth0) plane is mandatory on every host in every mode. Kustomize overlays (overlays/dual-node|tri-node|multi-node/) pin each workload tier to its dedicated host via nodeSelector patches.


11. Cross-references

Topic Document
Component tour, sequence diagrams, DUT topology docs/ARCHITECTURE.md
F2 auto-provisioning + cell-0 deep-dive docs/F2_HETZNER_CELL0_ARCHITECTURE.md
ZTP-prem 12-layer security docs/SECURITY_ZTP_PREM.md
Cell-based hyperscale rationale ADR-0053
Decision records index ADR/README.md
Project conventions & rules CLAUDE.md