Linux host tuning — REQUIRED for the persona stacks and the agent fleets¶

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Target hardware: Ubuntu 22.04 LTS or newer, x86_64 or arm64. Goal: extract maximum HTTP/2 + HTTP/3 throughput from the persona Caddy pool and keep the browser-engine + synthetic-load agent fleets stable at hundreds of replicas. Three layers — container-level (docker-compose.webserver.yml), in-cluster runtime (k8s/dut/85- node-tuning.yaml DaemonSet), and host-level (this doc + the automation script).

Why this matters — for both webservers AND agents¶

Layer	Symptoms without tuning
Synthetic Personas + Cloned Personas (Caddy)	QUIC capped at ~30 Mbps per replica (kernel UDP buffer overflow), TLS handshake p99 jitter (~1 ms extra after every keepalive idle), CPU contention with kubelet/system processes during crypto bursts
browser-engine fleet (300 replicas)	DNS resolution stalls (default ephemeral port range exhaustion), TIME_WAIT zombies hold sockets for 16 minutes blocking new connections
synthetic-load fleet (1000 replicas)	Random "TCP connect timeout" errors as the 262 K-default conntrack table overflows under burst-close load
Cloner + NFS server	NFS over OOBI starves at default UDP buffers when the cloner writes a large mirror; periodic stalls visible to slot pods

The same sysctls help all four workload classes. Only cpuManagerPolicy: static is workload-specific — apply on persona-hosting nodes (Guaranteed QoS), skip on agent-hosting nodes (Burstable QoS, HPA-scaled).

Quick start — `scripts/host-tuning.sh`¶

The one-shot easy button. Idempotent (re-run safe), supports apply / status / remove.

# Persona-hosting nodes (role=ngfw-dut for synthetic + slots,
# role=infra for Cloner + NFS):
sudo scripts/host-tuning.sh apply --enable-cpu-pinning

# Agent-hosting nodes (role=playwright, role=k6):
sudo scripts/host-tuning.sh apply           # NO --enable-cpu-pinning

# Single-node — every workload on the only host:
sudo scripts/host-tuning.sh apply --enable-cpu-pinning

# Verify on every node (coloured ✓ / △ / ✗ report)
sudo scripts/host-tuning.sh status

# Roll back
sudo scripts/host-tuning.sh remove --enable-cpu-pinning

What it does:

Writes /etc/sysctl.d/99-ai-forse-perf.conf with every value the in-cluster node-tuning DaemonSet writes — making the values survive host reboots before the DaemonSet pod even starts.
Writes /etc/modules-load.d/ai-forse.conf so nf_conntrack + tls load at boot.
Installs cpu-perf.service — a systemd unit that pins every CPU to the performance governor at boot, before any workload schedules.
Sets transparent huge pages to defer+madvise defrag with always enabled.
(With --enable-cpu-pinning) patches the kubelet config to enable cpuManagerPolicy: static + topologyManagerPolicy: single-numa-node, clears cpu_manager_state, and restarts the kubelet. Vanilla kubeadm and k3s are auto-detected.

Legacy mode (Docker Compose only — no longer the canonical path):

sudo scripts/tune-ubuntu-host.sh
scripts/stack-up.sh restart-webserver

The legacy script (scripts/tune-ubuntu-host.sh) covers only the webserver Docker stack — host-tuning.sh supersedes it for Kubernetes deploys.

Why each knob matters¶

UDP buffers — critical for HTTP/3¶

Knob	Default Ubuntu	We set	Why
`net.core.rmem_max`	208 KiB	7.5 MB	Caddy's quic-go asks for 7 MiB receive buffer. Default = packet drops under load. Boot log without this fix: "failed to sufficiently increase receive buffer size (was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB)."
`net.core.wmem_max`	208 KiB	7.5 MB	Same on send side
`net.core.rmem_default` / `wmem_default`	208 KiB	2.5 MB	Per-socket starting size; raise so even short-lived connections benefit

TCP backlog & keep-alive — HTTP/2 and HTTP/1.1¶

Knob	Default	We set	Why
`net.core.somaxconn`	4 096	65 535	Listener accept queue. Bursts of new HTTP/2 connections silently drop when this fills
`net.ipv4.tcp_max_syn_backlog`	1 024	65 535	Half-open connection backlog (SYN-RECEIVED state)
`net.ipv4.tcp_synack_retries`	5	3	How many times to retry SYN-ACK. 5× = 60 s wait per dead client
`net.ipv4.tcp_tw_reuse`	0	1	Allow recycling TIME_WAIT for new connections — useful when agents hammer the same webserver
`net.ipv4.tcp_fin_timeout`	60	15	Faster TIME_WAIT teardown
`net.ipv4.tcp_keepalive_time`	7 200	60	Detect dead peers in 60 s, not 2 h
`net.ipv4.tcp_keepalive_probes`	9	3	Three probes is enough
`net.ipv4.tcp_keepalive_intvl`	75	10	10 s between probes
`net.ipv4.tcp_fastopen`	1	3	TFO active for both client + server roles
`net.core.netdev_max_backlog`	1 000	16 384	Per-CPU NIC packet queue. 1 000 saturates at ~50 k pps on a single core

Congestion control — BBR¶

net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

BBR (Bottleneck Bandwidth and RTT) is dramatically better than the default cubic for high-throughput long-fat-network scenarios — typical for a webserver serving the public internet. Requires fq qdisc for correct pacing. On Ubuntu 22.04+ both are available out of the box; on older releases run:

modprobe tcp_bbr
echo tcp_bbr | sudo tee -a /etc/modules-load.d/bbr.conf

The verification step in tune-ubuntu-host.sh warns if BBR didn't activate.

File descriptors — system-wide ceiling¶

Knob	Default Ubuntu 24.04	We set	Why
`fs.file-max`	~9.2 M	26.2 M	At 30 webservers (20 Synthetic + 10 Cloned slots) × 1 048 576 fd/container = ~32 M theoretical. Default is tight
`fs.nr_open`	~1 M	26.2 M	Per-process ceiling — without this, the container-level `ulimit nofile=1048576` can't actually be granted

Conntrack — Docker bridge networking¶

Docker bridge networks (used by ai_forse_oobi and ai_forse_prod) route through the host's iptables, which means every active connection sits in the conntrack table.

Knob	Default	We set	Why
`net.netfilter.nf_conntrack_max`	65 536	1 048 576	Default fills under 50 k+ active conns → kernel logs `nf_conntrack: table full, dropping packet`
`net.netfilter.nf_conntrack_buckets`	16 384	262 144	Hash table buckets. Should be ~max/4 for good distribution
`net.netfilter.nf_conntrack_tcp_timeout_established`	432 000 (5 d)	1 200 (20 min)	Default keeps dead connections in the table for 5 days

NIC ring buffers — tuned per device¶

The script reads each interface's max ring buffer size via ethtool -g and sets it. On a real NIC this is typically 4 096 RX + 4 096 TX (e.g. Intel ixgbe / Mellanox mlx5). On virtual NICs (QEMU virtio-net, vmxnet3, AWS ENA) the max is whatever the hypervisor exposes — the script falls back to a no-op if rejected.

Higher ring buffers = more headroom under packet bursts before the kernel starts dropping at the NIC level.

What we did NOT tune (yet)¶

These deliver real gains but require choices that depend on your specific hardware, so they're not in the auto-script:

Topic	Why it's not auto-applied
IRQ affinity / RPS / RFS	Optimal CPU-to-IRQ mapping depends on # cores, NUMA layout, NIC queues. Wrong choice can hurt instead of help. See `/proc/interrupts` + manual `/proc/irq/N/smp_affinity`
NUMA pinning (`numactl --cpunodebind`)	Only useful on multi-socket servers (rare in cloud VMs)
Hugepages	Caddy's working set is small; little benefit
Disabling Spectre/Meltdown mitigations	Not recommended — usually 5-10 % throughput loss isn't worth the security trade-off
MTU 9000 (jumbo frames)	Requires every hop in the path to support it. Internal-network only

If you need to push past what this doc gives you, the bottleneck almost certainly moves from kernel network stack to CPU — at which point you should profile Caddy with perf top -p $(pgrep caddy) and consider Layer B (custom Caddy build with GOAMD64=v3).

Container-level tuning (Layer A — opt-in)¶

The base docker-compose.webserver.yml ships without the kernel sysctls because Docker on macOS / Windows (LinuxKit VM) rejects writes to several /proc/sys/net/core/* files even from the container's own network namespace, with errors like:

OCI runtime create failed: open sysctl net.core.wmem_max:
reopen fd 8: permission denied

To opt in on a native Linux host (Ubuntu, Debian, RHEL, …), layer the tuning override:

WEBSERVER_LINUX_TUNING=1 scripts/stack-up.sh up

Or by hand:

docker compose -p ai_forse_webserver \
  -f docker-compose.webserver.yml \
  -f docker-compose.webserver.linux-tuning.yml \
  up -d

Or persistently in .env:

COMPOSE_FILE=docker-compose.webserver.yml:docker-compose.webserver.linux-tuning.yml

The docker-compose.webserver.linux-tuning.yml override layers in:

ulimits:
  nofile:
    soft: 1048576
    hard: 1048576
sysctls:
  net.core.rmem_max: '7500000'
  net.core.wmem_max: '7500000'
  net.core.rmem_default: '2500000'
  net.core.wmem_default: '2500000'
  net.core.somaxconn: '65535'
  net.ipv4.tcp_fastopen: '3'

These work without root on the host, and without privileged mode in the container — Docker scopes them to the container's network/process namespace.

The host-side tuning script (scripts/tune-ubuntu-host.sh) and the 99-ai-forse-webserver.conf sysctl drop-in do not depend on Layer A — they tune the kernel of the host itself, which benefits every container regardless of opt-in status. Recommended on every Ubuntu production host.

Verification¶

After applying, expect to see (no more) the boot warning:

{"level":"info","msg":"failed to sufficiently increase receive buffer size
(was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB)..."}

Inside a running webserver:

docker exec ai_forse_webserver-webserver-1 sh -c '
  cat /proc/sys/net/core/rmem_max
  cat /proc/sys/net/core/somaxconn
  cat /proc/sys/net/ipv4/tcp_fastopen
'
# Expected:
# 7500000
# 65535
# 3

On the host:

sysctl net.ipv4.tcp_congestion_control net.core.default_qdisc
# net.ipv4.tcp_congestion_control = bbr
# net.core.default_qdisc = fq

CPU pinning — `cpuManagerPolicy: static` (Kubernetes mode only)¶

The persona Caddy pods request integer CPU values with request == limit, which gives them Guaranteed QoS — the prerequisite for kubelet's exclusive-core allocator. Once cpuManagerPolicy: static is enabled, every Caddy pod gets cores fully reserved to it and never shares a core with another container. Eliminates the ~10–20 % p99 latency tail caused by context switches during TLS crypto bursts.

Apply on every node hosting personas¶

# /var/lib/kubelet/config.yaml on each ngfw-dut node
cat <<'EOF' | sudo tee -a /var/lib/kubelet/config.yaml
cpuManagerPolicy: static
topologyManagerPolicy: single-numa-node
reservedSystemCPUs: "0,1"   # cores 0+1 reserved for kubelet + system
EOF

# Static policy state lives at /var/lib/kubelet/cpu_manager_state — must
# be cleared once per node before the new policy takes effect.
sudo rm -f /var/lib/kubelet/cpu_manager_state
sudo systemctl restart kubelet

For k3s the kubelet config flag goes in /etc/rancher/k3s/config.yaml:

kubelet-arg:
  - cpu-manager-policy=static
  - topology-manager-policy=single-numa-node
  - reserved-cpus=0,1

Then sudo rm -f /var/lib/rancher/k3s/agent/kubelet/cpu_manager_state and sudo systemctl restart k3s (or k3s-agent).

Verify¶

# After the kubelet restart and a fresh persona pod schedule:
kubectl exec -n persona-blog deploy/caddy -- sh -c \
  'cat /sys/fs/cgroup/cpuset.cpus.effective'
# Should print a small contiguous range (e.g. "4-5"), not "0-63".

NUMA topology hint¶

If your UCS hosts have two NUMA nodes (typical), the topologyManagerPolicy: single-numa-node flag tells the scheduler to keep the pod's CPU + memory + (PCI device) on the same NUMA node. For Caddy this matters because the macvlan NIC subinterface lives on one NUMA node — a pod scheduled across nodes pays a QPI / UPI hop on every packet.

Persistent sysctls (across reboots)¶

The DaemonSet node-tuning (k8s/dut/85-node-tuning.yaml) applies all sysctls + kernel module loads at pod start, but the kernel reverts on host reboot. To survive reboots without waiting for the DaemonSet to come up, drop the same values into a sysctl.d file:

sudo tee /etc/sysctl.d/99-ai-forse-perf.conf > /dev/null <<'EOF'
# UDP / QUIC
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 26214400
net.core.wmem_default = 26214400
# TCP / HTTP/2
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
# Latency / cwnd / PMTU
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_mtu_probing = 1
net.core.so_max_pacing_rate = 0
# Connection capacity + TIME_WAIT
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_orphan_retries = 2
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_notsent_lowat = 131072
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_tw_buckets = 2000000
# conntrack
net.netfilter.nf_conntrack_max = 2097152
# Per-socket non-data
net.core.optmem_max = 131072
# PIDs + FDs
kernel.pid_max = 4194304
fs.file-max = 2097152
# Memory pressure
vm.swappiness = 5
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# NUMA / locality
kernel.numa_balancing = 0
vm.zone_reclaim_mode = 0
EOF

# Auto-load nf_conntrack + tls modules at boot.
echo 'nf_conntrack' | sudo tee /etc/modules-load.d/ai-forse.conf
echo 'tls'          | sudo tee -a /etc/modules-load.d/ai-forse.conf

# CPU governor — performance forever, not just until next reboot.
sudo tee /etc/systemd/system/cpu-perf.service > /dev/null <<'EOF'
[Unit]
Description=Pin all CPUs to the performance governor
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for f in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > "$f" 2>/dev/null || true; done'
RemainAfterExit=true

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now cpu-perf.service

sudo sysctl --system

After this, the DaemonSet node-tuning becomes a belt-and-braces idempotent re-applier rather than the only line of defence.

Rollback¶

sudo rm /etc/sysctl.d/99-ai-forse-webserver.conf
sudo rm /etc/sysctl.d/99-ai-forse-perf.conf
sudo rm /etc/modules-load.d/ai-forse.conf
sudo systemctl disable --now cpu-perf.service
sudo rm /etc/systemd/system/cpu-perf.service
sudo sysctl --system
# NIC ring buffers reset on next reboot, or:
sudo ethtool -G <nic> rx 256 tx 256
# Revert kubelet cpuManagerPolicy: edit /var/lib/kubelet/config.yaml,
# remove cpu_manager_state, restart kubelet.

The container-level sysctls are unconditional — to revert those edit docker-compose.webserver.yml and restart the stack.

Benchmarking the impact¶

A future PR ships benchmarks/ with h2load recipes for HTTP/2 and HTTP/3, plus a results/ directory with before/after baselines per host class. Run there once on a representative box, archive the JSON, and use it to detect performance regressions.