NAT Testing Modes¶

Read in your language: English · Português · Español

Status: Foundation in v4.3 — ADR 0007 §16 (matrix) + §22 (orchestrator) + §26 (Annex H + I).

Scope status (post-Scope-Freeze 2026-05-10) — NAT testing modes integrate with the SDWAN/CoR-N.Art MÓDULO (modules/sdwan-cor-art.md) for DIA (Direct Internet Access) scenarios. "VPN-REMOTE" legacy term refers to the same data-plane leg as MÓDULO SDWAN/CoR-N.Art local mode. Real cloud endpoint testing available via Cloud Endpoint Service (ADR 0023).

Why test both NAT modes¶

Production NGFW capacity is shaped by two largely independent CPU costs: TLS crypto and NAT translation. Customers frequently misattribute throughput limits to decrypt when the real bottleneck is NAT, or vice versa. A test that runs only the production configuration cannot decompose these two costs.

This test-bed exercises the NGFW under two NAT modes:

nat_mode: pat — source-PAT (NAT Overload). NGFW maintains a translation table and rewrites source IP/port for every flow. Production-realistic; CPU- and memory-heavy.
nat_mode: disabled — pure routed forwarding. NGFW only inspects and forwards; no translation, no xlate table. Theoretical max throughput for the device under decrypt load.

Each test plan declares which mode it requires. The DUT API (dashboard/src/lib/dut-api/) applies the corresponding vendor-specific config before the run starts and reverts at the end.

What changes inside the NGFW¶

Aspect	`nat_mode: pat`	`nat_mode: disabled`
Translation table writes	Yes — every new flow consumes one xlate entry	None
Source/dest rewrites	Source IP + port rewrite per packet	Pass-through
Conntrack inflation	High — each NAT'd flow takes one conntrack entry	Low — only inspection-state
CPU profile (typical)	NAT engine 15–25%, crypto 40–55%, inspection 15–20%	NAT engine 0%, crypto 50–60%, inspection 25–30%
Throughput penalty	15–30% vs `nat_mode: disabled` baseline	Reference baseline
Failure mode	Pool exhaustion under stress (test results invalidated)	Connection limit / inspection backlog

When to choose each mode¶

Scenario	Recommended mode
Customer comissioning a new NGFW for capacity planning	Both — run the MATRIX plan (`CAP-FIND-KNEE-MATRIX-2H`) for full quadrant decomposition
Quick sanity check (NGFW config validity, decrypt working)	`nat_mode: disabled` (Q1 baseline) — fastest, fewest variables
Full production-mirror test	`nat_mode: pat` (Q4)
Isolating NAT engine vs crypto engine cost	`nat_mode: disabled` + `decrypt: on` (Q2) AND `nat_mode: pat` + `decrypt: off` (Q3); compare deltas
Testing NAT pool sizing decision	`nat_mode: pat` with reduced pool — verify exhaustion triggers expected alerts

The 2×2 quadrant matrix¶

The MATRIX plan (CAP-FIND-KNEE-MATRIX-2H) orchestrates four 30-minute sub-runs sequentially:

                         WITHOUT NAT             WITH NAT
                  ┌───────────────────────┬───────────────────────┐
   WITHOUT        │ Q1 Baseline           │ Q3 NAT-only           │
   DECRYPT        │ raw forward           │ isolates NAT cost     │
                  ├───────────────────────┼───────────────────────┤
   WITH           │ Q2 Decrypt-only       │ Q4 Production         │
   DECRYPT        │ isolates crypto cost  │ NAT + decrypt         │
                  └───────────────────────┴───────────────────────┘

The result is a single Test Run Report containing Annex I — Capacity Quadrant Decomposition with delta arithmetic (NAT alone, decrypt alone, combined, synergy bonus) and customer-facing optimization guidance.

See dashboard/templates/annex-i-quadrant-decomposition.md.tmpl for the full annex layout.

Vendor-specific tuning (per `nat_pool_overload: tuned`)¶

When a plan declares nat_pool_overload: tuned, the DUT API applies these tunings per vendor before the run starts. Templates live in dashboard/src/lib/dut-api/<vendor>/templates/nat_pat.tmpl.

Cisco FTD (cdFMC / FMC)¶

Setting	Tuned value	Why
`xlate-timeout`	180s (vs default 3600s)	Returns translation entries to the pool faster on stress runs
`tcp-halfclose-timer`	15s (vs default 60s)	Releases pending TCP-RST/FIN flows quickly
`nat-protocol-table-size`	per-platform max (e.g., 4M on FTD 4115)	Maximizes pool capacity
`nat-rule-prioritization`	enabled	Avoids pool fragmentation under heavy churn

Fortinet FortiGate¶

Setting	Tuned value	Why
`tcp-timewait-timer`	30	Releases TIME_WAIT sockets quickly
`tcp-halfclose-timer`	15	Releases pending TCP closes quickly
`port-block-allocation`	enable	Reduces pool exhaustion at high CPS
`central-snat-policy`	one wide-net rule (vs many specific)	Reduces lookup overhead

Palo Alto Networks (PAN-OS)¶

Setting	Tuned value	Why
`dynamic-ip-and-port translation timeout`	180s	Aggressive recycling
`tcp half-closed timeout`	15s	Same as FTD
`oversubscription`	enabled with rate 4	Allows 4× more concurrent NAT translations per public IP

The actual values used in any specific run are recorded in Annex H — NAT Engine Performance, sub-section H.1 (vendor_template_id + vendor_template_sha256), so reviewers can verify the tuning was applied as documented.

Reading the alerts (runbook anchors)¶

The Prometheus alerts in k8s/dut/96-nat-engine-rules.yaml carry runbook_url annotations pointing at this document. Each anchor below is the runbook for the corresponding alert.

Pool utilization warning¶

Triggered by NATPoolUtilizationHigh when pool > 80% sustained 30s.

What it means: NAT pool is approaching exhaustion under current load.

Operator response: 1. Check the NAT engine dashboard (Grafana → dut-nat-engine) for trajectory — is utilization climbing or stable? 2. If climbing and run is long-running (soak / 8h): consider stopping the run and re-running with smaller agent fleet. 3. If stable around 80–90% throughout the test: capacity claim is valid but the customer should know this NGFW model would exhaust at slightly higher load.

Tuning option: lower xlate-timeout to 60s (Cisco FTD) or tcp-timewait-timer to 15s (Fortinet) — reclaims pool entries faster.

Pool exhaustion¶

Triggered by NATPoolExhaustion when any exhaustion event occurs in last 1m.

What it means: NGFW silently dropped at least one translation request because the pool was full. Test results are invalidated — the test bed is measuring failure, not capacity.

Operator response: 1. Immediately stop the run if still active. 2. Annex H of the report will mark this run as invalidated with the exhaustion event count. 3. Increase the NAT pool size in the DUT API NAT template (vendor-specific). 4. Re-run with corrected configuration.

Root cause guidance: at 1000 synthetic-load agents × ~10 connections/sec × 30 min = ~18M flows; if your pool is < 4M entries with xlate-timeout: 3600s, exhaustion is mathematically inevitable. The nat_pool_overload: tuned template addresses this — verify it was applied (Annex H §H.1).

Translation timeout spike¶

Triggered by NATTranslationTimeoutSpike when translation timeout rate > 100/sec sustained 1m.

What it means: Many flows are being aged out by the xlate-timeout rather than naturally closed (RST/FIN). Indicates short-lived flows holding xlate entries past their natural lifetime.

Operator response: 1. Doesn't invalidate the run; informational severity. 2. Tune NAT template for next run: - Cisco FTD: lower xlate-timeout to 60s (matches synthetic-load engine burst pattern) - Fortinet: lower tcp-halfclose-timer to 15s (releases pending closes faster) 3. Re-run; expect measurably higher CPS without retuning agents.

How to interpret Annex H + I in the report¶

Annex H — NAT Engine Performance (rendered for `nat_mode: pat` runs)¶

§H.1 Configuration applied — the NAT template + tuning values used. Verify these match what you intended (and what the customer's production NGFW uses).
§H.2 Runtime metrics — pool peak utilization is the headline number. Anything above 80% is yellow flag; exhaustion events > 0 is red flag.
§H.3 Capacity correlation — CPU breakdown reveals where the engine is constrained. NAT slice >25% means NAT is a meaningful bottleneck; <10% means NAT is essentially free for this NGFW.

Annex I — Capacity Quadrant Decomposition (rendered for `type: matrix` runs)¶

§I.2 Decomposition table — the four numbers (NAT alone, decrypt alone, combined, synergy) are the customer's main answer. Synergy near zero means NAT and decrypt are orthogonal CPU paths.
§I.3 Where the CPU goes — per-quadrant CPU breakdown. This is the slide the customer's network architect will screenshot.
§I.4 Customer guidance — ranked optimization paths. Use this verbatim in the customer-facing briefing.

Plan reference¶

Plan identifier	Mode	Duration	Use case
`CAP-FIND-KNEE-30M-Q1`	`nat: disabled, decrypt: off`	30 min	Baseline; first run of any engagement
`CAP-FIND-KNEE-30M-Q2`	`nat: disabled, decrypt: on`	30 min	Decrypt cost in isolation
`CAP-FIND-KNEE-30M-Q3`	`nat: pat, decrypt: off`	30 min	NAT cost in isolation
`CAP-FIND-KNEE-30M-Q4`	`nat: pat, decrypt: on`	30 min	Production scenario
`CAP-FIND-KNEE-MATRIX-2H`	All four sequentially	2 hours	Full quadrant decomposition

See platform/test-plans/catalog.yaml for the canonical definitions.

docs/REPORTS.md — Test Run Reports overview
docs/TEST_PLANS.md — Test plan catalog reference
docs/MONITORING_TEST_VALIDITY.md — Alerts that prove the test bed itself was healthy
dashboard/templates/annex-h-nat-engine.md.tmpl — Annex H template
dashboard/templates/annex-i-quadrant-decomposition.md.tmpl — Annex I template
observability/grafana/dashboards/dut-nat-engine.json — live dashboard during runs
k8s/dut/96-nat-engine-rules.yaml — Prometheus alerts referenced above