Skip to content

NAT Testing Modes

Read in your language: English · Português · Español

Status: Foundation in v4.3 — ADR 0007 §16 (matrix) + §22 (orchestrator) + §26 (Annex H + I).

Scope status (post-Scope-Freeze 2026-05-10) — NAT testing modes integrate with the SDWAN/CoR-N.Art MÓDULO (modules/sdwan-cor-art.md) for DIA (Direct Internet Access) scenarios. "VPN-REMOTE" legacy term refers to the same data-plane leg as MÓDULO SDWAN/CoR-N.Art local mode. Real cloud endpoint testing available via Cloud Endpoint Service (ADR 0023).

Why test both NAT modes

Production NGFW capacity is shaped by two largely independent CPU costs: TLS crypto and NAT translation. Customers frequently misattribute throughput limits to decrypt when the real bottleneck is NAT, or vice versa. A test that runs only the production configuration cannot decompose these two costs.

This test-bed exercises the NGFW under two NAT modes:

  • nat_mode: pat — source-PAT (NAT Overload). NGFW maintains a translation table and rewrites source IP/port for every flow. Production-realistic; CPU- and memory-heavy.
  • nat_mode: disabled — pure routed forwarding. NGFW only inspects and forwards; no translation, no xlate table. Theoretical max throughput for the device under decrypt load.

Each test plan declares which mode it requires. The DUT API (dashboard/src/lib/dut-api/) applies the corresponding vendor-specific config before the run starts and reverts at the end.

What changes inside the NGFW

Aspect nat_mode: pat nat_mode: disabled
Translation table writes Yes — every new flow consumes one xlate entry None
Source/dest rewrites Source IP + port rewrite per packet Pass-through
Conntrack inflation High — each NAT'd flow takes one conntrack entry Low — only inspection-state
CPU profile (typical) NAT engine 15–25%, crypto 40–55%, inspection 15–20% NAT engine 0%, crypto 50–60%, inspection 25–30%
Throughput penalty 15–30% vs nat_mode: disabled baseline Reference baseline
Failure mode Pool exhaustion under stress (test results invalidated) Connection limit / inspection backlog

When to choose each mode

Scenario Recommended mode
Customer comissioning a new NGFW for capacity planning Both — run the MATRIX plan (CAP-FIND-KNEE-MATRIX-2H) for full quadrant decomposition
Quick sanity check (NGFW config validity, decrypt working) nat_mode: disabled (Q1 baseline) — fastest, fewest variables
Full production-mirror test nat_mode: pat (Q4)
Isolating NAT engine vs crypto engine cost nat_mode: disabled + decrypt: on (Q2) AND nat_mode: pat + decrypt: off (Q3); compare deltas
Testing NAT pool sizing decision nat_mode: pat with reduced pool — verify exhaustion triggers expected alerts

The 2×2 quadrant matrix

The MATRIX plan (CAP-FIND-KNEE-MATRIX-2H) orchestrates four 30-minute sub-runs sequentially:

                         WITHOUT NAT             WITH NAT
                  ┌───────────────────────┬───────────────────────┐
   WITHOUT        │ Q1 Baseline           │ Q3 NAT-only           │
   DECRYPT        │ raw forward           │ isolates NAT cost     │
                  ├───────────────────────┼───────────────────────┤
   WITH           │ Q2 Decrypt-only       │ Q4 Production         │
   DECRYPT        │ isolates crypto cost  │ NAT + decrypt         │
                  └───────────────────────┴───────────────────────┘

The result is a single Test Run Report containing Annex I — Capacity Quadrant Decomposition with delta arithmetic (NAT alone, decrypt alone, combined, synergy bonus) and customer-facing optimization guidance.

See dashboard/templates/annex-i-quadrant-decomposition.md.tmpl for the full annex layout.

Vendor-specific tuning (per nat_pool_overload: tuned)

When a plan declares nat_pool_overload: tuned, the DUT API applies these tunings per vendor before the run starts. Templates live in dashboard/src/lib/dut-api/<vendor>/templates/nat_pat.tmpl.

Cisco FTD (cdFMC / FMC)

Setting Tuned value Why
xlate-timeout 180s (vs default 3600s) Returns translation entries to the pool faster on stress runs
tcp-halfclose-timer 15s (vs default 60s) Releases pending TCP-RST/FIN flows quickly
nat-protocol-table-size per-platform max (e.g., 4M on FTD 4115) Maximizes pool capacity
nat-rule-prioritization enabled Avoids pool fragmentation under heavy churn

Fortinet FortiGate

Setting Tuned value Why
tcp-timewait-timer 30 Releases TIME_WAIT sockets quickly
tcp-halfclose-timer 15 Releases pending TCP closes quickly
port-block-allocation enable Reduces pool exhaustion at high CPS
central-snat-policy one wide-net rule (vs many specific) Reduces lookup overhead

Palo Alto Networks (PAN-OS)

Setting Tuned value Why
dynamic-ip-and-port translation timeout 180s Aggressive recycling
tcp half-closed timeout 15s Same as FTD
oversubscription enabled with rate 4 Allows 4× more concurrent NAT translations per public IP

The actual values used in any specific run are recorded in Annex H — NAT Engine Performance, sub-section H.1 (vendor_template_id + vendor_template_sha256), so reviewers can verify the tuning was applied as documented.

Reading the alerts (runbook anchors)

The Prometheus alerts in k8s/dut/96-nat-engine-rules.yaml carry runbook_url annotations pointing at this document. Each anchor below is the runbook for the corresponding alert.

Pool utilization warning

Triggered by NATPoolUtilizationHigh when pool > 80% sustained 30s.

What it means: NAT pool is approaching exhaustion under current load.

Operator response: 1. Check the NAT engine dashboard (Grafana → dut-nat-engine) for trajectory — is utilization climbing or stable? 2. If climbing and run is long-running (soak / 8h): consider stopping the run and re-running with smaller agent fleet. 3. If stable around 80–90% throughout the test: capacity claim is valid but the customer should know this NGFW model would exhaust at slightly higher load.

Tuning option: lower xlate-timeout to 60s (Cisco FTD) or tcp-timewait-timer to 15s (Fortinet) — reclaims pool entries faster.

Pool exhaustion

Triggered by NATPoolExhaustion when any exhaustion event occurs in last 1m.

What it means: NGFW silently dropped at least one translation request because the pool was full. Test results are invalidated — the test bed is measuring failure, not capacity.

Operator response: 1. Immediately stop the run if still active. 2. Annex H of the report will mark this run as invalidated with the exhaustion event count. 3. Increase the NAT pool size in the DUT API NAT template (vendor-specific). 4. Re-run with corrected configuration.

Root cause guidance: at 1000 synthetic-load agents × ~10 connections/sec × 30 min = ~18M flows; if your pool is < 4M entries with xlate-timeout: 3600s, exhaustion is mathematically inevitable. The nat_pool_overload: tuned template addresses this — verify it was applied (Annex H §H.1).

Translation timeout spike

Triggered by NATTranslationTimeoutSpike when translation timeout rate > 100/sec sustained 1m.

What it means: Many flows are being aged out by the xlate-timeout rather than naturally closed (RST/FIN). Indicates short-lived flows holding xlate entries past their natural lifetime.

Operator response: 1. Doesn't invalidate the run; informational severity. 2. Tune NAT template for next run: - Cisco FTD: lower xlate-timeout to 60s (matches synthetic-load engine burst pattern) - Fortinet: lower tcp-halfclose-timer to 15s (releases pending closes faster) 3. Re-run; expect measurably higher CPS without retuning agents.

How to interpret Annex H + I in the report

Annex H — NAT Engine Performance (rendered for nat_mode: pat runs)

  • §H.1 Configuration applied — the NAT template + tuning values used. Verify these match what you intended (and what the customer's production NGFW uses).
  • §H.2 Runtime metrics — pool peak utilization is the headline number. Anything above 80% is yellow flag; exhaustion events > 0 is red flag.
  • §H.3 Capacity correlation — CPU breakdown reveals where the engine is constrained. NAT slice >25% means NAT is a meaningful bottleneck; <10% means NAT is essentially free for this NGFW.

Annex I — Capacity Quadrant Decomposition (rendered for type: matrix runs)

  • §I.2 Decomposition table — the four numbers (NAT alone, decrypt alone, combined, synergy) are the customer's main answer. Synergy near zero means NAT and decrypt are orthogonal CPU paths.
  • §I.3 Where the CPU goes — per-quadrant CPU breakdown. This is the slide the customer's network architect will screenshot.
  • §I.4 Customer guidance — ranked optimization paths. Use this verbatim in the customer-facing briefing.

Plan reference

Plan identifier Mode Duration Use case
CAP-FIND-KNEE-30M-Q1 nat: disabled, decrypt: off 30 min Baseline; first run of any engagement
CAP-FIND-KNEE-30M-Q2 nat: disabled, decrypt: on 30 min Decrypt cost in isolation
CAP-FIND-KNEE-30M-Q3 nat: pat, decrypt: off 30 min NAT cost in isolation
CAP-FIND-KNEE-30M-Q4 nat: pat, decrypt: on 30 min Production scenario
CAP-FIND-KNEE-MATRIX-2H All four sequentially 2 hours Full quadrant decomposition

See platform/test-plans/catalog.yaml for the canonical definitions.