NAT Testing Modes¶
Read in your language: English · Português · Español
Status: Foundation in v4.3 — ADR 0007 §16 (matrix) + §22 (orchestrator) + §26 (Annex H + I).
Scope status (post-Scope-Freeze 2026-05-10) — NAT testing modes integrate with the SDWAN/CoR-N.Art MÓDULO (modules/sdwan-cor-art.md) for DIA (Direct Internet Access) scenarios. "VPN-REMOTE" legacy term refers to the same data-plane leg as MÓDULO SDWAN/CoR-N.Art
localmode. Real cloud endpoint testing available via Cloud Endpoint Service (ADR 0023).
Why test both NAT modes¶
Production NGFW capacity is shaped by two largely independent CPU costs: TLS crypto and NAT translation. Customers frequently misattribute throughput limits to decrypt when the real bottleneck is NAT, or vice versa. A test that runs only the production configuration cannot decompose these two costs.
This test-bed exercises the NGFW under two NAT modes:
nat_mode: pat— source-PAT (NAT Overload). NGFW maintains a translation table and rewrites source IP/port for every flow. Production-realistic; CPU- and memory-heavy.nat_mode: disabled— pure routed forwarding. NGFW only inspects and forwards; no translation, no xlate table. Theoretical max throughput for the device under decrypt load.
Each test plan declares which mode it requires. The DUT API (dashboard/src/lib/dut-api/) applies the corresponding vendor-specific config before the run starts and reverts at the end.
What changes inside the NGFW¶
| Aspect | nat_mode: pat |
nat_mode: disabled |
|---|---|---|
| Translation table writes | Yes — every new flow consumes one xlate entry | None |
| Source/dest rewrites | Source IP + port rewrite per packet | Pass-through |
| Conntrack inflation | High — each NAT'd flow takes one conntrack entry | Low — only inspection-state |
| CPU profile (typical) | NAT engine 15–25%, crypto 40–55%, inspection 15–20% | NAT engine 0%, crypto 50–60%, inspection 25–30% |
| Throughput penalty | 15–30% vs nat_mode: disabled baseline |
Reference baseline |
| Failure mode | Pool exhaustion under stress (test results invalidated) | Connection limit / inspection backlog |
When to choose each mode¶
| Scenario | Recommended mode |
|---|---|
| Customer comissioning a new NGFW for capacity planning | Both — run the MATRIX plan (CAP-FIND-KNEE-MATRIX-2H) for full quadrant decomposition |
| Quick sanity check (NGFW config validity, decrypt working) | nat_mode: disabled (Q1 baseline) — fastest, fewest variables |
| Full production-mirror test | nat_mode: pat (Q4) |
| Isolating NAT engine vs crypto engine cost | nat_mode: disabled + decrypt: on (Q2) AND nat_mode: pat + decrypt: off (Q3); compare deltas |
| Testing NAT pool sizing decision | nat_mode: pat with reduced pool — verify exhaustion triggers expected alerts |
The 2×2 quadrant matrix¶
The MATRIX plan (CAP-FIND-KNEE-MATRIX-2H) orchestrates four 30-minute sub-runs sequentially:
WITHOUT NAT WITH NAT
┌───────────────────────┬───────────────────────┐
WITHOUT │ Q1 Baseline │ Q3 NAT-only │
DECRYPT │ raw forward │ isolates NAT cost │
├───────────────────────┼───────────────────────┤
WITH │ Q2 Decrypt-only │ Q4 Production │
DECRYPT │ isolates crypto cost │ NAT + decrypt │
└───────────────────────┴───────────────────────┘
The result is a single Test Run Report containing Annex I — Capacity Quadrant Decomposition with delta arithmetic (NAT alone, decrypt alone, combined, synergy bonus) and customer-facing optimization guidance.
See dashboard/templates/annex-i-quadrant-decomposition.md.tmpl for the full annex layout.
Vendor-specific tuning (per nat_pool_overload: tuned)¶
When a plan declares nat_pool_overload: tuned, the DUT API applies these tunings per vendor before the run starts. Templates live in dashboard/src/lib/dut-api/<vendor>/templates/nat_pat.tmpl.
Cisco FTD (cdFMC / FMC)¶
| Setting | Tuned value | Why |
|---|---|---|
xlate-timeout |
180s (vs default 3600s) | Returns translation entries to the pool faster on stress runs |
tcp-halfclose-timer |
15s (vs default 60s) | Releases pending TCP-RST/FIN flows quickly |
nat-protocol-table-size |
per-platform max (e.g., 4M on FTD 4115) | Maximizes pool capacity |
nat-rule-prioritization |
enabled | Avoids pool fragmentation under heavy churn |
Fortinet FortiGate¶
| Setting | Tuned value | Why |
|---|---|---|
tcp-timewait-timer |
30 | Releases TIME_WAIT sockets quickly |
tcp-halfclose-timer |
15 | Releases pending TCP closes quickly |
port-block-allocation |
enable | Reduces pool exhaustion at high CPS |
central-snat-policy |
one wide-net rule (vs many specific) | Reduces lookup overhead |
Palo Alto Networks (PAN-OS)¶
| Setting | Tuned value | Why |
|---|---|---|
dynamic-ip-and-port translation timeout |
180s | Aggressive recycling |
tcp half-closed timeout |
15s | Same as FTD |
oversubscription |
enabled with rate 4 | Allows 4× more concurrent NAT translations per public IP |
The actual values used in any specific run are recorded in Annex H — NAT Engine Performance, sub-section H.1 (vendor_template_id + vendor_template_sha256), so reviewers can verify the tuning was applied as documented.
Reading the alerts (runbook anchors)¶
The Prometheus alerts in k8s/dut/96-nat-engine-rules.yaml carry runbook_url annotations pointing at this document. Each anchor below is the runbook for the corresponding alert.
Pool utilization warning¶
Triggered by NATPoolUtilizationHigh when pool > 80% sustained 30s.
What it means: NAT pool is approaching exhaustion under current load.
Operator response:
1. Check the NAT engine dashboard (Grafana → dut-nat-engine) for trajectory — is utilization climbing or stable?
2. If climbing and run is long-running (soak / 8h): consider stopping the run and re-running with smaller agent fleet.
3. If stable around 80–90% throughout the test: capacity claim is valid but the customer should know this NGFW model would exhaust at slightly higher load.
Tuning option: lower xlate-timeout to 60s (Cisco FTD) or tcp-timewait-timer to 15s (Fortinet) — reclaims pool entries faster.
Pool exhaustion¶
Triggered by NATPoolExhaustion when any exhaustion event occurs in last 1m.
What it means: NGFW silently dropped at least one translation request because the pool was full. Test results are invalidated — the test bed is measuring failure, not capacity.
Operator response: 1. Immediately stop the run if still active. 2. Annex H of the report will mark this run as invalidated with the exhaustion event count. 3. Increase the NAT pool size in the DUT API NAT template (vendor-specific). 4. Re-run with corrected configuration.
Root cause guidance: at 1000 synthetic-load agents × ~10 connections/sec × 30 min = ~18M flows; if your pool is < 4M entries with xlate-timeout: 3600s, exhaustion is mathematically inevitable. The nat_pool_overload: tuned template addresses this — verify it was applied (Annex H §H.1).
Translation timeout spike¶
Triggered by NATTranslationTimeoutSpike when translation timeout rate > 100/sec sustained 1m.
What it means: Many flows are being aged out by the xlate-timeout rather than naturally closed (RST/FIN). Indicates short-lived flows holding xlate entries past their natural lifetime.
Operator response:
1. Doesn't invalidate the run; informational severity.
2. Tune NAT template for next run:
- Cisco FTD: lower xlate-timeout to 60s (matches synthetic-load engine burst pattern)
- Fortinet: lower tcp-halfclose-timer to 15s (releases pending closes faster)
3. Re-run; expect measurably higher CPS without retuning agents.
How to interpret Annex H + I in the report¶
Annex H — NAT Engine Performance (rendered for nat_mode: pat runs)¶
- §H.1 Configuration applied — the NAT template + tuning values used. Verify these match what you intended (and what the customer's production NGFW uses).
- §H.2 Runtime metrics — pool peak utilization is the headline number. Anything above 80% is yellow flag; exhaustion events > 0 is red flag.
- §H.3 Capacity correlation — CPU breakdown reveals where the engine is constrained. NAT slice >25% means NAT is a meaningful bottleneck; <10% means NAT is essentially free for this NGFW.
Annex I — Capacity Quadrant Decomposition (rendered for type: matrix runs)¶
- §I.2 Decomposition table — the four numbers (NAT alone, decrypt alone, combined, synergy) are the customer's main answer. Synergy near zero means NAT and decrypt are orthogonal CPU paths.
- §I.3 Where the CPU goes — per-quadrant CPU breakdown. This is the slide the customer's network architect will screenshot.
- §I.4 Customer guidance — ranked optimization paths. Use this verbatim in the customer-facing briefing.
Plan reference¶
| Plan identifier | Mode | Duration | Use case |
|---|---|---|---|
CAP-FIND-KNEE-30M-Q1 |
nat: disabled, decrypt: off |
30 min | Baseline; first run of any engagement |
CAP-FIND-KNEE-30M-Q2 |
nat: disabled, decrypt: on |
30 min | Decrypt cost in isolation |
CAP-FIND-KNEE-30M-Q3 |
nat: pat, decrypt: off |
30 min | NAT cost in isolation |
CAP-FIND-KNEE-30M-Q4 |
nat: pat, decrypt: on |
30 min | Production scenario |
CAP-FIND-KNEE-MATRIX-2H |
All four sequentially | 2 hours | Full quadrant decomposition |
See platform/test-plans/catalog.yaml for the canonical definitions.
Related¶
docs/REPORTS.md— Test Run Reports overviewdocs/TEST_PLANS.md— Test plan catalog referencedocs/MONITORING_TEST_VALIDITY.md— Alerts that prove the test bed itself was healthydashboard/templates/annex-h-nat-engine.md.tmpl— Annex H templatedashboard/templates/annex-i-quadrant-decomposition.md.tmpl— Annex I templateobservability/grafana/dashboards/dut-nat-engine.json— live dashboard during runsk8s/dut/96-nat-engine-rules.yaml— Prometheus alerts referenced above