Skip to content

ADR 0008 — Branch Office Simulation: free-form bandwidth shaping at the WAN boundary

  • Status: Accepted
  • Date: 2026-05-08
  • Deciders: TLSStress.Art project
  • Targets: v4.4.0

Context

The current TLSStress.Art test bench characterizes the NGFW under "datacenter-grade" conditions: the simulated ISP link between the NGFW outside interface and the VyOS-ISP router (the 200.130.0.0/30 edge link) is unconstrained. Whatever throughput the NGFW + crypto cores can sustain, the bench measures.

This is NOT how production NGFWs are deployed at branch offices. A typical enterprise branch has:

  • 100 Mbps cable / fiber ISP link (most common for SMBs)
  • 50 Mbps DSL (legacy / rural)
  • 200 Mbps dedicated MPLS (mid-market enterprise)
  • 1 Gbps fiber (large branch)
  • Often asymmetric (e.g. 100/30 Mbps down/up on cable, 1G/100M on prosumer fiber)

In production, the NGFW often never approaches its peak throughput because the WAN link is the bottleneck. The interesting failure modes shift:

Datacenter scenario Branch office scenario
Bulk crypto throughput dominates Per-session latency dominates
Aggregate Gbps matters Connection count matters
Buffers run dry → drops Buffers fill → queueing
TLS bulk record decryption is the cost TLS handshake CPU is the cost
NAT pool sized by total flows NAT pool sized by sustained connections

The NGFW behaves DIFFERENTLY in these two regimes. A NGFW that benchmarks beautifully at 10 Gbps in the lab can still misbehave at 100 Mbps in a branch — buffer bloat under tc shaping, cipher fast-path being warmed but not fully utilized, NAT pool churn under low data volume but high connection count.

Customers ask "how does my NGFW behave at MY branch's link speed?" — and we don't have a clean answer today.

Decision

Add Branch Office Simulation as a new test capability in v4.4. Operator-controllable bandwidth shaping is applied at the VyOS-ISP router (the simulated carrier edge), bidirectional and asymmetric, with no other modifications to the test bench.

Locked-in design choices

Parameter Value Rationale
Bandwidth input Free-form integer + unit (Mbps OR Gbps) No fixed presets — operator types exact value (e.g. 100 Mbps, 1 Gbps, 50000 Kbps). Range: 1 Mbps ≤ value ≤ 100 Gbps.
Asymmetry YES — separate down_mbps and up_mbps inputs Real ISPs are almost always asymmetric. Cable/DSL/fiber-to-the-home typically 5:1 ratio. Symmetric is the special case (down == up).
Latency injection NO — bandwidth shaping ONLY Real ISP latency comes from the physical media + routing, which our lab cabling doesn't reproduce realistically. Adding fake latency would compound an artifact, not a faithful simulation. We measure what the NGFW does under bandwidth pressure alone.
Shaping location VyOS-ISP eth1 (the WAN /30 toward NGFW) This is the simulated carrier-edge — the same place real ISPs apply their shaping.
Shaping technology VyOS native QoS (set qos policy shaper) Wraps Linux tc + htb (egress) + ifb (ingress). VyOS abstracts the kernel complexity into a declarative interface. Avoids hand-rolled tc commands that age poorly.
AQM NONE (HTB classful + bfifo default queue) Adding fq_codel or cake would introduce an Active Queue Management artifact that masks the NGFW's actual buffer behavior. We want the operator to see the NGFW's buffer behavior, not VyOS's queue smarts.
Matrix relationship Orthogonal AXIS, NOT a 5th quadrant Q1-Q4 isolate NGFW VARIABLES (NAT, decrypt). Bandwidth is an ENVIRONMENT variable, not a NGFW variable. The two compose: Quadrant × Environment Profile. Operator picks both at run launch.

What "Branch Office Simulation" enables (and doesn't)

ENABLES: - Real-world ISP scenario reproduction (100 Mbps asymmetric branch, 500 Mbps regional HQ, 1 Gbps datacenter, etc.) - Per-session NGFW behavior under bandwidth contention - TLS handshake CPU saturation testing (handshakes/sec is bandwidth- independent; the bench reveals when the NGFW handshake engine bottleneck arrives REGARDLESS of bulk throughput) - NAT pool churn measurement under sustained low-volume connections - Latency P99 distribution under queueing pressure (not under wire-rate)

DOES NOT: - Inject artificial latency (RTT spikes from physical distance, jitter from radio media — those need explicit latency injection, future v4.x if customer demand) - Simulate ISP-side packet loss (also a future addition) - Shape per-application or per-flow (single global shaper per direction) - Replace the existing Q1-Q4 quadrant matrix (composes with it)

Decomposition: how this fits the test matrix

                  Quadrant (NGFW config)        ×   Environment Profile
                  ─────────────────────────────────────────────────────
                  Q1 baseline (no NAT, no DPI)  ×   Datacenter (no shaping)   ← today
                  Q4 production (NAT + decrypt) ×   Datacenter                 ← today
                  Q4 production                  ×   Branch-100M (asymmetric)  ← NEW
                  Q4 production                  ×   Remote-DSL (50/10M)        ← NEW
                  Q4 production                  ×   Regional-1G (symmetric)    ← NEW
                                                                          ↑
                                                                  NEW dimension

UI on the test plan launcher gains an "Environment" picker:

Quadrant:    [Q4 production ▾]
Environment: ○ Datacenter (no shaping)
             ● Branch Office Simulation
                 Down: [_____] [Mbps▾]
                 Up:   [_____] [Mbps▾]

Implementation summary (across the 4 v4.4 PRs)

  1. PR-A (this) — methodology ADR + 3-language operator runbook (docs/BRANCH_OFFICE.md + .pt-BR.md + .es.md).

  2. PR-B — VyOS QoS configuration:

  3. scripts/bo-shape.sh (kubectl-exec'd into VyOS pod, applies set qos policy shaper for both directions, idempotent)
  4. scripts/bo-unshape.sh (clean removal)
  5. scripts/bo-verify.sh (queries actual configured rate vs requested, emits JSON consumed by Annex J)

  6. PR-C — test plan engine + dashboard:

  7. Test plan schema gains an optional environment.bandwidth field ({down_mbps, up_mbps} or null)
  8. Dashboard launcher adds the Environment radio + numeric inputs
  9. Test-run engine calls scripts/bo-shape.sh before agents start, scripts/bo-unshape.sh at run cleanup
  10. Existing test plans retain backward compatibility (no environment block = datacenter, no shaping)

  11. PR-D — Annex J template:

  12. New section in run report attesting:
    • The requested vs actually-configured rate (in case tc rounded)
    • Pre-test ping/iperf3 baseline (proves shaper is in place)
    • Per-direction byte counters at run end (proves shaper enforced)
    • Per-direction drop counters (sign of how often the cap was hit)
  13. Verifier script wired into preflight (similar to airgap-l2-verify)

Consequences

Positive

  • First test bench in the market that combines NAT × Decrypt decomposition WITH realistic WAN bandwidth conditions on the same NGFW interface. IXIA/Spirent test these separately.
  • Customer-relevant: every branch deployment SE asks this question; today they get vendor catalog numbers.
  • Reproducible: free-form input + Annex J attesting actual rate means a customer can re-run with their real WAN bandwidth and get comparable numbers.
  • Composable: doesn't break Q1-Q4. Operator picks both axes.

Negative

  • tc (Linux Traffic Control) introduces its own queueing, which could mask NGFW behavior. Mitigation: HTB classful + bfifo default queue (no AQM smarts), and Annex J reports per-direction drop counters so the operator can see if tc is dropping vs the NGFW.
  • VyOS QoS is per-interface, not per-quadrant. A test campaign comparing "Q4 + Branch-100M" vs "Q4 + Datacenter" requires re-running with shaping on/off — can't be done in a single sweep.
  • Asymmetric ingress shaping requires ifb (Intermediate Functional Block) — VyOS handles this transparently but Annex J should disclose it for forensic completeness.

Neutral

  • Bandwidth shaping is independent of IPSec/SDWAN work (v4.6/v4.8) — the SDWAN tunnel will share the same shaped WAN link, which is exactly the realistic scenario for SD-WAN over rate-limited DIA.

Operator-facing rules

  1. Default behaviour when no Environment is selected: NO shaping (current bench behaviour preserved). Test plans without an environment.bandwidth block run unconstrained.

  2. Asymmetric inputs are mandatory when shaping is enabled — the UI requires both down_mbps and up_mbps (defaulting to equal values for the symmetric case is OK, but the operator must enter both intentionally).

  3. Pre-test verification runs automatically: scripts/bo-verify.sh confirms VyOS QoS is configured AND measures actual capacity via a 5-second iperf3 baseline before launching the test agents. If the actual capacity is more than 10% off the configured rate, the run is blocked (production gating) or warns (observational dev).

  4. Run report Annex J discloses both the REQUESTED rate (operator input) and the MEASURED rate (verifier output). If they diverge, the report makes that visible — never silently rounded.

Alternatives considered

Alternative A — Hardware shaper (e.g. Cisco N9K QoS)

Rejected. Forces dependency on Nexus QoS configuration, which differs across NX-OS versions and is harder to automate. VyOS QoS is software on a Linux pod — fully reproducible and portable across labs.

Alternative B — Inject latency alongside bandwidth (tc qdisc netem)

Rejected for v4.4. Latency is a separate concern with its own methodology questions (constant vs variable, per-flow vs per-link, modeling RTT vs jitter). Bundling latency into BO would make BO do too much. Future ADR may add a "Latency Profile" axis if customer demand emerges.

Alternative C — Per-quadrant shaping (different bandwidth per Q1-Q4)

Rejected. The whole point of Q1-Q4 is to isolate NGFW variables with the environment held constant. Allowing per-quadrant shaping would make decomposition impossible (you couldn't say "Q4 - Q1 = production overhead" if Q1 and Q4 ran at different bandwidths).

Alternative D — AQM-friendly defaults (fq_codel or cake)

Rejected. AQM smooths the queueing artifact and gives nicer-looking latency curves. But the customer's NGFW under their REAL ISP link is NOT going to have AQM in the carrier router — the goal is to expose NGFW behavior, not to make the test bench look good. Plain HTB + bfifo is the methodologically correct choice.

Implementation references

  • scripts/bo-shape.sh — applies VyOS QoS shaping (PR-B)
  • scripts/bo-unshape.sh — removes shaping (PR-B)
  • scripts/bo-verify.sh — verifies actual vs requested rate (PR-B)
  • k8s/dut/45-vyos-isp-router.yaml — VyOS pod (no change in PR-B; the ConfigMap stays L3-only as documented in PR-C of v4.3.1; shaping is applied at runtime via kubectl exec, not baked into the manifest)
  • dashboard/src/components/TestPlanLauncher.tsx — Environment picker UI (PR-C)
  • dashboard/src/lib/test-plans/schema.tsenvironment.bandwidth schema field (PR-C)
  • dashboard/templates/annex-j-branch-office.md.tmpl — new annex (PR-D)
  • docs/BRANCH_OFFICE.md (en/pt-BR/es) — operator runbook (this PR)

References

  • ADR 0007 (Public-Internet Realism) — establishes the VyOS-ISP topology this ADR builds on
  • ADR 0009 (L2 BPDU isolation) — v4.3.1 prerequisite (BPDU filter must be enforced before BO touches VyOS interface state)
  • VyOS QoS handbook — set qos policy shaper reference
  • Linux tc-htb(8) — kernel-level mechanics under VyOS abstraction
  • RFC 2544 (Benchmarking Methodology for Network Interconnect Devices) — baseline measurement methodology for the verifier
  • RFC 6349 (Framework for TCP Throughput Testing) — TCP-specific measurement methodology, applied by the verifier's iperf3 baseline