Linux host tuning — REQUIRED for the persona stacks and the agent fleets¶
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.
Target hardware: Ubuntu 22.04 LTS or newer, x86_64 or arm64. Goal: extract maximum HTTP/2 + HTTP/3 throughput from the persona Caddy pool and keep the browser-engine + synthetic-load agent fleets stable at hundreds of replicas. Three layers — container-level (
docker-compose.webserver.yml), in-cluster runtime (k8s/dut/85- node-tuning.yamlDaemonSet), and host-level (this doc + the automation script).
Why this matters — for both webservers AND agents¶
| Layer | Symptoms without tuning |
|---|---|
| Synthetic Personas + Cloned Personas (Caddy) | QUIC capped at ~30 Mbps per replica (kernel UDP buffer overflow), TLS handshake p99 jitter (~1 ms extra after every keepalive idle), CPU contention with kubelet/system processes during crypto bursts |
| browser-engine fleet (300 replicas) | DNS resolution stalls (default ephemeral port range exhaustion), TIME_WAIT zombies hold sockets for 16 minutes blocking new connections |
| synthetic-load fleet (1000 replicas) | Random "TCP connect timeout" errors as the 262 K-default conntrack table overflows under burst-close load |
| Cloner + NFS server | NFS over OOBI starves at default UDP buffers when the cloner writes a large mirror; periodic stalls visible to slot pods |
The same sysctls help all four workload classes. Only cpuManagerPolicy: static is workload-specific — apply on persona-hosting nodes (Guaranteed QoS), skip on agent-hosting nodes (Burstable QoS, HPA-scaled).
Quick start — scripts/host-tuning.sh¶
The one-shot easy button. Idempotent (re-run safe), supports apply / status / remove.
# Persona-hosting nodes (role=ngfw-dut for synthetic + slots,
# role=infra for Cloner + NFS):
sudo scripts/host-tuning.sh apply --enable-cpu-pinning
# Agent-hosting nodes (role=playwright, role=k6):
sudo scripts/host-tuning.sh apply # NO --enable-cpu-pinning
# Single-node — every workload on the only host:
sudo scripts/host-tuning.sh apply --enable-cpu-pinning
# Verify on every node (coloured ✓ / △ / ✗ report)
sudo scripts/host-tuning.sh status
# Roll back
sudo scripts/host-tuning.sh remove --enable-cpu-pinning
What it does:
- Writes
/etc/sysctl.d/99-ai-forse-perf.confwith every value the in-clusternode-tuningDaemonSet writes — making the values survive host reboots before the DaemonSet pod even starts. - Writes
/etc/modules-load.d/ai-forse.confsonf_conntrack+tlsload at boot. - Installs
cpu-perf.service— a systemd unit that pins every CPU to theperformancegovernor at boot, before any workload schedules. - Sets transparent huge pages to
defer+madvisedefrag withalwaysenabled. - (With
--enable-cpu-pinning) patches the kubelet config to enablecpuManagerPolicy: static+topologyManagerPolicy: single-numa-node, clearscpu_manager_state, and restarts the kubelet. Vanilla kubeadm and k3s are auto-detected.
Legacy mode (Docker Compose only — no longer the canonical path):
sudo scripts/tune-ubuntu-host.sh
scripts/stack-up.sh restart-webserver
The legacy script (scripts/tune-ubuntu-host.sh) covers only the webserver Docker stack — host-tuning.sh supersedes it for Kubernetes deploys.
Why each knob matters¶
UDP buffers — critical for HTTP/3¶
| Knob | Default Ubuntu | We set | Why |
|---|---|---|---|
net.core.rmem_max |
208 KiB | 7.5 MB | Caddy's quic-go asks for 7 MiB receive buffer. Default = packet drops under load. Boot log without this fix: "failed to sufficiently increase receive buffer size (was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB)." |
net.core.wmem_max |
208 KiB | 7.5 MB | Same on send side |
net.core.rmem_default / wmem_default |
208 KiB | 2.5 MB | Per-socket starting size; raise so even short-lived connections benefit |
TCP backlog & keep-alive — HTTP/2 and HTTP/1.1¶
| Knob | Default | We set | Why |
|---|---|---|---|
net.core.somaxconn |
4 096 | 65 535 | Listener accept queue. Bursts of new HTTP/2 connections silently drop when this fills |
net.ipv4.tcp_max_syn_backlog |
1 024 | 65 535 | Half-open connection backlog (SYN-RECEIVED state) |
net.ipv4.tcp_synack_retries |
5 | 3 | How many times to retry SYN-ACK. 5× = 60 s wait per dead client |
net.ipv4.tcp_tw_reuse |
0 | 1 | Allow recycling TIME_WAIT for new connections — useful when agents hammer the same webserver |
net.ipv4.tcp_fin_timeout |
60 | 15 | Faster TIME_WAIT teardown |
net.ipv4.tcp_keepalive_time |
7 200 | 60 | Detect dead peers in 60 s, not 2 h |
net.ipv4.tcp_keepalive_probes |
9 | 3 | Three probes is enough |
net.ipv4.tcp_keepalive_intvl |
75 | 10 | 10 s between probes |
net.ipv4.tcp_fastopen |
1 | 3 | TFO active for both client + server roles |
net.core.netdev_max_backlog |
1 000 | 16 384 | Per-CPU NIC packet queue. 1 000 saturates at ~50 k pps on a single core |
Congestion control — BBR¶
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
BBR (Bottleneck Bandwidth and RTT) is dramatically better than the default cubic for high-throughput long-fat-network scenarios — typical for a webserver serving the public internet. Requires fq qdisc for correct pacing. On Ubuntu 22.04+ both are available out of the box; on older releases run:
modprobe tcp_bbr
echo tcp_bbr | sudo tee -a /etc/modules-load.d/bbr.conf
The verification step in tune-ubuntu-host.sh warns if BBR didn't activate.
File descriptors — system-wide ceiling¶
| Knob | Default Ubuntu 24.04 | We set | Why |
|---|---|---|---|
fs.file-max |
~9.2 M | 26.2 M | At 30 webservers (20 Synthetic + 10 Cloned slots) × 1 048 576 fd/container = ~32 M theoretical. Default is tight |
fs.nr_open |
~1 M | 26.2 M | Per-process ceiling — without this, the container-level ulimit nofile=1048576 can't actually be granted |
Conntrack — Docker bridge networking¶
Docker bridge networks (used by ai_forse_oobi and ai_forse_prod) route through the host's iptables, which means every active connection sits in the conntrack table.
| Knob | Default | We set | Why |
|---|---|---|---|
net.netfilter.nf_conntrack_max |
65 536 | 1 048 576 | Default fills under 50 k+ active conns → kernel logs nf_conntrack: table full, dropping packet |
net.netfilter.nf_conntrack_buckets |
16 384 | 262 144 | Hash table buckets. Should be ~max/4 for good distribution |
net.netfilter.nf_conntrack_tcp_timeout_established |
432 000 (5 d) | 1 200 (20 min) | Default keeps dead connections in the table for 5 days |
NIC ring buffers — tuned per device¶
The script reads each interface's max ring buffer size via ethtool -g and sets it. On a real NIC this is typically 4 096 RX + 4 096 TX (e.g. Intel ixgbe / Mellanox mlx5). On virtual NICs (QEMU virtio-net, vmxnet3, AWS ENA) the max is whatever the hypervisor exposes — the script falls back to a no-op if rejected.
Higher ring buffers = more headroom under packet bursts before the kernel starts dropping at the NIC level.
What we did NOT tune (yet)¶
These deliver real gains but require choices that depend on your specific hardware, so they're not in the auto-script:
| Topic | Why it's not auto-applied |
|---|---|
| IRQ affinity / RPS / RFS | Optimal CPU-to-IRQ mapping depends on # cores, NUMA layout, NIC queues. Wrong choice can hurt instead of help. See /proc/interrupts + manual /proc/irq/N/smp_affinity |
NUMA pinning (numactl --cpunodebind) |
Only useful on multi-socket servers (rare in cloud VMs) |
| Hugepages | Caddy's working set is small; little benefit |
| Disabling Spectre/Meltdown mitigations | Not recommended — usually 5-10 % throughput loss isn't worth the security trade-off |
| MTU 9000 (jumbo frames) | Requires every hop in the path to support it. Internal-network only |
If you need to push past what this doc gives you, the bottleneck almost certainly moves from kernel network stack to CPU — at which point you should profile Caddy with perf top -p $(pgrep caddy) and consider Layer B (custom Caddy build with GOAMD64=v3).
Container-level tuning (Layer A — opt-in)¶
The base docker-compose.webserver.yml ships without the kernel sysctls because Docker on macOS / Windows (LinuxKit VM) rejects writes to several /proc/sys/net/core/* files even from the container's own network namespace, with errors like:
OCI runtime create failed: open sysctl net.core.wmem_max:
reopen fd 8: permission denied
To opt in on a native Linux host (Ubuntu, Debian, RHEL, …), layer the tuning override:
WEBSERVER_LINUX_TUNING=1 scripts/stack-up.sh up
Or by hand:
docker compose -p ai_forse_webserver \
-f docker-compose.webserver.yml \
-f docker-compose.webserver.linux-tuning.yml \
up -d
Or persistently in .env:
COMPOSE_FILE=docker-compose.webserver.yml:docker-compose.webserver.linux-tuning.yml
The docker-compose.webserver.linux-tuning.yml override layers in:
ulimits:
nofile:
soft: 1048576
hard: 1048576
sysctls:
net.core.rmem_max: '7500000'
net.core.wmem_max: '7500000'
net.core.rmem_default: '2500000'
net.core.wmem_default: '2500000'
net.core.somaxconn: '65535'
net.ipv4.tcp_fastopen: '3'
These work without root on the host, and without privileged mode in the container — Docker scopes them to the container's network/process namespace.
The host-side tuning script (scripts/tune-ubuntu-host.sh) and the 99-ai-forse-webserver.conf sysctl drop-in do not depend on Layer A — they tune the kernel of the host itself, which benefits every container regardless of opt-in status. Recommended on every Ubuntu production host.
Verification¶
After applying, expect to see (no more) the boot warning:
{"level":"info","msg":"failed to sufficiently increase receive buffer size
(was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB)..."}
Inside a running webserver:
docker exec ai_forse_webserver-webserver-1 sh -c '
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/core/somaxconn
cat /proc/sys/net/ipv4/tcp_fastopen
'
# Expected:
# 7500000
# 65535
# 3
On the host:
sysctl net.ipv4.tcp_congestion_control net.core.default_qdisc
# net.ipv4.tcp_congestion_control = bbr
# net.core.default_qdisc = fq
CPU pinning — cpuManagerPolicy: static (Kubernetes mode only)¶
The persona Caddy pods request integer CPU values with request == limit, which gives them Guaranteed QoS — the prerequisite for kubelet's exclusive-core allocator. Once cpuManagerPolicy: static is enabled, every Caddy pod gets cores fully reserved to it and never shares a core with another container. Eliminates the ~10–20 % p99 latency tail caused by context switches during TLS crypto bursts.
Apply on every node hosting personas¶
# /var/lib/kubelet/config.yaml on each ngfw-dut node
cat <<'EOF' | sudo tee -a /var/lib/kubelet/config.yaml
cpuManagerPolicy: static
topologyManagerPolicy: single-numa-node
reservedSystemCPUs: "0,1" # cores 0+1 reserved for kubelet + system
EOF
# Static policy state lives at /var/lib/kubelet/cpu_manager_state — must
# be cleared once per node before the new policy takes effect.
sudo rm -f /var/lib/kubelet/cpu_manager_state
sudo systemctl restart kubelet
For k3s the kubelet config flag goes in /etc/rancher/k3s/config.yaml:
kubelet-arg:
- cpu-manager-policy=static
- topology-manager-policy=single-numa-node
- reserved-cpus=0,1
Then sudo rm -f /var/lib/rancher/k3s/agent/kubelet/cpu_manager_state and sudo systemctl restart k3s (or k3s-agent).
Verify¶
# After the kubelet restart and a fresh persona pod schedule:
kubectl exec -n persona-blog deploy/caddy -- sh -c \
'cat /sys/fs/cgroup/cpuset.cpus.effective'
# Should print a small contiguous range (e.g. "4-5"), not "0-63".
NUMA topology hint¶
If your UCS hosts have two NUMA nodes (typical), the
topologyManagerPolicy: single-numa-node flag tells the scheduler to
keep the pod's CPU + memory + (PCI device) on the same NUMA node.
For Caddy this matters because the macvlan NIC subinterface lives on
one NUMA node — a pod scheduled across nodes pays a QPI / UPI hop on
every packet.
Persistent sysctls (across reboots)¶
The DaemonSet node-tuning (k8s/dut/85-node-tuning.yaml) applies all sysctls + kernel module loads at pod start, but the kernel reverts on host reboot. To survive reboots without waiting for the DaemonSet to come up, drop the same values into a sysctl.d file:
sudo tee /etc/sysctl.d/99-ai-forse-perf.conf > /dev/null <<'EOF'
# UDP / QUIC
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 26214400
net.core.wmem_default = 26214400
# TCP / HTTP/2
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
# Latency / cwnd / PMTU
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_mtu_probing = 1
net.core.so_max_pacing_rate = 0
# Connection capacity + TIME_WAIT
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_orphan_retries = 2
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_notsent_lowat = 131072
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_tw_buckets = 2000000
# conntrack
net.netfilter.nf_conntrack_max = 2097152
# Per-socket non-data
net.core.optmem_max = 131072
# PIDs + FDs
kernel.pid_max = 4194304
fs.file-max = 2097152
# Memory pressure
vm.swappiness = 5
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# NUMA / locality
kernel.numa_balancing = 0
vm.zone_reclaim_mode = 0
EOF
# Auto-load nf_conntrack + tls modules at boot.
echo 'nf_conntrack' | sudo tee /etc/modules-load.d/ai-forse.conf
echo 'tls' | sudo tee -a /etc/modules-load.d/ai-forse.conf
# CPU governor — performance forever, not just until next reboot.
sudo tee /etc/systemd/system/cpu-perf.service > /dev/null <<'EOF'
[Unit]
Description=Pin all CPUs to the performance governor
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for f in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > "$f" 2>/dev/null || true; done'
RemainAfterExit=true
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now cpu-perf.service
sudo sysctl --system
After this, the DaemonSet node-tuning becomes a belt-and-braces idempotent re-applier rather than the only line of defence.
Rollback¶
sudo rm /etc/sysctl.d/99-ai-forse-webserver.conf
sudo rm /etc/sysctl.d/99-ai-forse-perf.conf
sudo rm /etc/modules-load.d/ai-forse.conf
sudo systemctl disable --now cpu-perf.service
sudo rm /etc/systemd/system/cpu-perf.service
sudo sysctl --system
# NIC ring buffers reset on next reboot, or:
sudo ethtool -G <nic> rx 256 tx 256
# Revert kubelet cpuManagerPolicy: edit /var/lib/kubelet/config.yaml,
# remove cpu_manager_state, restart kubelet.
The container-level sysctls are unconditional — to revert those edit docker-compose.webserver.yml and restart the stack.
Benchmarking the impact¶
A future PR ships benchmarks/ with h2load recipes for HTTP/2 and HTTP/3, plus a results/ directory with before/after baselines per host class. Run there once on a representative box, archive the JSON, and use it to detect performance regressions.
See also¶
k8s/dut/85-node-tuning.yaml— DaemonSet that runs every sysctl on pod-start (matches this drop-in)scripts/host-tuning/99-ai-forse-webserver.conf— the legacy sysctl drop-in (Docker-mode only)scripts/tune-ubuntu-host.sh— apply + verify wrapperdocker-compose.webserver.yml— container-level sysctls + ulimitswebserver/Caddyfile— application-level performance posture (TLS posture, timeouts, encode, curves)- quic-go UDP buffer wiki — upstream guidance for the receive buffer fix
- Kubernetes CPU Manager docs — how
cpuManagerPolicy: staticinteracts with QoS classes