ADR 0007 — Public-Internet Realism: public IPs, multi-CA, ISP router, NAT/decrypt matrix¶

Status: Accepted
Date: 2026-05-07
Deciders: André Luiz Gallon
Targets: v4.2.0 (foundation) + v4.3.0 (scale-out + quadrant matrix)

Context¶

The current test-bed places the NGFW as the L3 gateway of every persona VLAN (10.1.x.0/27 for the 20 Synthetic Personas, 10.2.x.0/27 for the 10 Cloned Persona slots). This is not how production NGFWs are deployed. In production:

The NGFW is an inspection hop, not an access router. Customer servers do not hang off NGFW sub-interfaces; they live behind a separate L3 device (the ISP/carrier router or a customer-edge switch).
The link between customer-edge NGFW and the ISP carrier is a public IP /30 assigned by the carrier; the NGFW default-routes to the carrier next-hop.
Web servers reachable from corporate clients have public IP addresses geo-classified to their actual hosting country; clients accessing the internet do not see RFC1918 destinations.
TLS certificates are issued by multiple public CAs (Let's Encrypt, DigiCert, Sectigo, GlobalSign, GeoTrust, ...) — never by a single internal CA across all destinations.
The NGFW typically performs source PAT (NAT Overload) on the egress link, mapping thousands of internal source IPs to a single public IP.

The current arrangement undermines the defensibility of the Test Run Report in customer engagements: a critical reviewer can argue that the measured TLS decryption capacity reflects an NGFW operating as an L3 access switch, not as a perimeter inspection device. The ARCHITECTURE.md diagram itself shows the NGFW terminating dozens of dot1q sub-interfaces — work that real NGFWs do not do, inflating CPU costs vs. the production scenario.

A second gap, surfaced during planning: production NGFW capacity is shaped by two largely independent CPU costs — TLS crypto and NAT translation. Customers frequently misattribute throughput limits to decrypt when the real bottleneck is NAT, or vice-versa. A test that runs only the production configuration cannot decompose these two costs.

This ADR records the decisions for the Public-Internet Realism transformation that addresses both gaps.

Decision¶

Network topology¶

Third-party public IPs are used for delivery LANs, accepted with rigorous air-gap. Lab is fully isolated; prefixes are validated WHOIS, DNSBL-clean, and GeoIP-consistent before each test run.
NGFW outside ↔ ISP Router link is a public /30 (default edge_country=BR, RNP 200.130.0.0/30 candidate). NGFW inside interface is <edge>/30 .1; ISP Router inside is .2.
NGFW default route: 0.0.0.0/0 → <edge>/30 .2 (the ISP Router).
ISP Router is a VyOS pod (4-layer air-gap: physical isolation, default → null0 blackhole, NetworkPolicy default-deny, zero external BGP peers). Routing is fully static — no BGP, no OSPF.
VyOS return-path routes: 172.16.0.0/16 → <edge>/30 .1 and 172.17.0.0/16 → <edge>/30 .1 always present, allowing No-NAT scenarios to work without changing VyOS config between runs.

Persona stack¶

100 personas total, distributed as 5 per country across 20 countries chosen by combined PIB+population score: US, CN, IN, DE, JP, GB, FR, IT, BR, CA, RU, MX, AU, ES, KR, ID, NL, SA, TR, NG.
Single unified persona stack. The legacy distinction between "Synthetic" and "Cloned" tiers is retired. Both are managed under the same personas/ source-of-truth, the same persona-{name} namespace pattern, and the same Helm chart. Cloned becomes the 5th archetype (cloned) alongside skin, mock, har-replay, real-app.
Archetype distribution across the 100 personas:
skin × 30 (Caddy file_server)
mock × 15 (Caddy + Go mock-engine sidecar)
har-replay × 10 (Caddy + Go HAR replay sidecar)
real-app × 5 (Saleor / Ghost / similar — heavyweight)
cloned × 40 (Caddy file_server over Cloner-managed PVC)
VLAN IDs are logical-only. 20 country LANs use VLAN IDs 101–120 as labels in manifests; no dot1q tagging propagates to the Nexus 9000. The persona stack lives entirely intra-cluster.
Multus macvlan in mode: private for each country LAN. The 5 pods of a country share the macvlan parent but are not L2-reachable to each other directly — gateway is the VyOS pod's country-attachment NIC.
Co-location enforced. VyOS pod and all 100 persona pods are pinned to the same UCS node (role=ngfw-dut) via nodeSelector + podAffinity. Required because Multus macvlan is node-local; cross-node L2 between VyOS attachments and persona pods is not provided.

Agent stack¶

Subnet of agents preserved — 172.16.0.0/16 (Playwright, VLAN 20) and 172.17.0.0/16 (K6, VLAN 30). NGFW outside continues as gateway.
Agent default route changed to NGFW. The Multus NADs dut-pw and dut-k6 no longer inject route-specific 10.1.0.0/16 and 10.2.0.0/16 — they inject 0.0.0.0/0 → <NGFW gateway> and the pod annotation declares the NAD as default-route. Cluster service reachability is preserved via kube-proxy DNAT (intercepts before FIB lookup).

NAT and decrypt matrix¶

Two NGFW NAT modes are exercised, configurable per test plan:
- nat_mode: pat (NGFW source-PAT with NAT Overload tuning — xlate-timeout, tcp-halfclose-timer, nat-protocol-table-size per vendor)
- nat_mode: disabled (pure routed; VyOS return-path routes make this work without config drift)
Two decrypt modes are exercised, configurable per test plan: decrypt_required: on and decrypt_required: off.
2×2 quadrant matrix of canonical capacity tests:
- Q1 — no NAT, no decrypt (theoretical max throughput)
- Q2 — no NAT, decrypt (isolates crypto cost)
- Q3 — NAT, no decrypt (isolates NAT cost)
- Q4 — NAT + decrypt (production scenario) A MATRIX orchestrator plan runs Q1→Q2→Q3→Q4 sequentially and emits a single consolidated report with quadrant decomposition.

PKI¶

Five sub-CAs simulating the global public-CA market — mock-letsencrypt-ca, mock-digicert-ca, mock-sectigo-ca, mock-globalsign-ca, mock-geotrust-ca — all chained off a single internal persona-internal-root-CA.
Three-level chain (Root → Intermediate → Leaf) matching real-world PKI structure.
Algorithm diversity per CA (LE-mock=ECDSA P-256, DigiCert-mock=RSA-2048, Sectigo-mock=RSA-2048, GlobalSign-mock=ECDSA P-384, GeoTrust-mock=RSA-2048).
Validity mix (LE-mock = 90 days, others = 1 year) — cert-manager rotates on schedule.
DV + OV cert types (real-app personas get OV, others DV).
Weighted-random allocation — the CA that signs each persona is picked by weighted random matching real CT-log market share (LE ~55%, DigiCert ~20%, Sectigo ~10%, GlobalSign ~5%, GeoTrust ~10%). Allocation is independent of country — production sites in any country use any CA.
NGFW imports a single root CA (persona-internal-root-CA) — all chains validate from that one root.

Operational source-of-truth¶

platform/network/public-ip-pool.yaml is the canonical source for all public-IP allocations: the /30 link, the 20 country /24s, fallback prefixes per country, validation timestamps, and DNS metadata. All generators (DNS records, NetworkPolicy egress allow-lists, NAD configs, cert SANs, VyOS routes) derive from this one file.
Best-effort prefix selection with mandatory pre-deploy validation. Initial pool is populated from research-and-education networks (NRENs) and stable hosting providers per country, marked with confidence (H/M/L). A pre-flight script (tools/validate-ip-pool.sh) re-runs on every deploy and on every test-run start: WHOIS country match, Spamhaus ZEN clean, GeoIP cross-check (≥3 sources), BGP advertisement status, PTR sample. Failure of any check on the primary aborts and prompts fallback selection — fallbacks pre-registered in the YAML.

Reporting¶

Test Run Report gains four new annexes:
- Annex F — Geo + CA distribution (per-country byte/flow split, CA chain distribution)
- Annex G — Air-gap attestation (4-layer evidence: physical, VyOS blackhole, NetworkPolicy, BGP attestation outputs)
- Annex H — NAT engine performance (only when nat_mode=pat: pool utilization, exhaustion events, translation timeout rate, CPU breakdown by engine)
- Annex I — Capacity quadrant decomposition (only on MATRIX plan runs: 2×2 chart, NAT-overhead %, decrypt-overhead %, synergy %, CPU breakdown side-by-side)

Release split¶

The transformation is split into v4.2.0 and v4.3.0:
- v4.2.0 — Public-Internet Realism foundation (~60h): multi-CA infrastructure, public-IP migration, CoreDNS .lab zone, VyOS pod, Annexes F/G, NAT modes (Q1–Q4 individual plans, no orchestrator), pre-flight validation, default-route inversion.
- v4.3.0 — Persona scale-out + quadrant matrix (~49h): stack unification (Cloned as 5th archetype), persona scale 30→100, VLAN expansion 101–120, MATRIX plan orchestrator, Annex H/I, NAT engine monitoring (Grafana + Prometheus rules), expanded test catalog (-MATRIX-, -WITH-NAT, -NO-NAT variants).

Consequences¶

Defensibility (positive)¶

Test results no longer rest on an unrealistic NGFW-as-access-router topology. Customer audit narrative becomes: "Tested under production-like carrier-edge handoff with public IPs, geo-distributed across 20 countries, served by 5 distinct CA chains matching real-world public-CA market share." Hard to refute.
The MATRIX/Annex I quadrant decomposition is a capability that no competitor (Spirent CyberFlood, Ixia BreakingPoint) packages out of the box. It directly answers customer questions like "How much of my NGFW budget is spent on NAT vs decrypt?".
Air-gap attestation in Annex G turns a defensive necessity (we use third-party prefixes in lab) into a proactive forensic artifact (we prove no leak occurred).

Operational cost¶

Persona node (UCS with role=ngfw-dut) accumulates higher resource load: ~28 GB RAM and ~16 cores estimated for 100 personas + VyOS at peak. Existing single-node and small-cluster deployments need to be re-sized. Multi-node deployments are unaffected if UCS-1 is sized for this footprint.
Operations now require maintaining the public-IP pool as living data: the pre-flight check runs on every test-run start and may flag a prefix for swap if its WHOIS/BGP/reputation state changes.
DUT API templates expand significantly per vendor (FTD, FortiGate, Palo Alto, etc.) to handle NAT mode toggles, decrypt mode toggles, and the resulting matrix combinations. Each vendor adapter must support 4 configurations (matrix quadrants).

Architectural debt accepted¶

Co-location requirement. The VyOS-as-pod design with Multus macvlan is node-local. Cross-node persona deployments are not supported in this architecture. Future ADR may revisit this if a CNI upgrade (flannel → Cilium) is undertaken — Cilium ClusterMesh would unblock cross-node L2 — but that's a separate decision (potentially v5.0).
No dynamic routing. BGP/OSPF are intentionally excluded. This keeps the test-bed simple and reproducible but means we cannot exercise NGFW behavior under route convergence events or BGP redistribution. A future ADR could add a routing-protocol dimension if customer demand emerges.
Use of third-party public IPs accepted. Caminho B (chosen over Caminho A "Cisco IT-allocated pool" and Caminho C "go fully air-gapped with TEST-NET") is the explicit owner decision. The strict 4-layer air-gap and pre-flight validation are the controls that make this acceptable.

Out of scope (deferred)¶

NetFlow / IPFIX flow ledger (Annex J) is deferred to v4.4.0 as a dedicated release. The infrastructure for it (DUT API flow-export templates, goflow2 collector, Postgres hypertable) does not block v4.2 or v4.3.
Source-IP anonymization in published Annexes (for customer engagements where lab IPs may inadvertently match production data) is deferred to v4.4 alongside flow ledger.
GitHub Merge Queue is unrelated to this transformation and is blocked by the repo's plan tier (private repo on GitHub Pro account does not include Merge Queue). Not addressed by this ADR.

References¶

ADR 0001 — TLS 1.3 + cert validation (strengthened by multi-CA realism here)
ADR 0003 — Playwright + Chromium agent (agents preserved)
ADR 0006 — K6 load-test fleet (K6 agents preserved)
Memory: project_roadmap.md contains the active roadmap with v4.2/4.3/4.4 milestones.