Skip to content

Linux host tuning — REQUIRED for the persona stacks and the agent fleets

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Target hardware: Ubuntu 22.04 LTS or newer, x86_64 or arm64. Goal: extract maximum HTTP/2 + HTTP/3 throughput from the persona Caddy pool and keep the browser-engine + synthetic-load agent fleets stable at hundreds of replicas. Three layers — container-level (docker-compose.webserver.yml), in-cluster runtime (k8s/dut/85- node-tuning.yaml DaemonSet), and host-level (this doc + the automation script).

Why this matters — for both webservers AND agents

Layer Symptoms without tuning
Synthetic Personas + Cloned Personas (Caddy) QUIC capped at ~30 Mbps per replica (kernel UDP buffer overflow), TLS handshake p99 jitter (~1 ms extra after every keepalive idle), CPU contention with kubelet/system processes during crypto bursts
browser-engine fleet (300 replicas) DNS resolution stalls (default ephemeral port range exhaustion), TIME_WAIT zombies hold sockets for 16 minutes blocking new connections
synthetic-load fleet (1000 replicas) Random "TCP connect timeout" errors as the 262 K-default conntrack table overflows under burst-close load
Cloner + NFS server NFS over OOBI starves at default UDP buffers when the cloner writes a large mirror; periodic stalls visible to slot pods

The same sysctls help all four workload classes. Only cpuManagerPolicy: static is workload-specific — apply on persona-hosting nodes (Guaranteed QoS), skip on agent-hosting nodes (Burstable QoS, HPA-scaled).

Quick start — scripts/host-tuning.sh

The one-shot easy button. Idempotent (re-run safe), supports apply / status / remove.

# Persona-hosting nodes (role=ngfw-dut for synthetic + slots,
# role=infra for Cloner + NFS):
sudo scripts/host-tuning.sh apply --enable-cpu-pinning

# Agent-hosting nodes (role=playwright, role=k6):
sudo scripts/host-tuning.sh apply           # NO --enable-cpu-pinning

# Single-node — every workload on the only host:
sudo scripts/host-tuning.sh apply --enable-cpu-pinning

# Verify on every node (coloured ✓ / △ / ✗ report)
sudo scripts/host-tuning.sh status

# Roll back
sudo scripts/host-tuning.sh remove --enable-cpu-pinning

What it does:

  1. Writes /etc/sysctl.d/99-ai-forse-perf.conf with every value the in-cluster node-tuning DaemonSet writes — making the values survive host reboots before the DaemonSet pod even starts.
  2. Writes /etc/modules-load.d/ai-forse.conf so nf_conntrack + tls load at boot.
  3. Installs cpu-perf.service — a systemd unit that pins every CPU to the performance governor at boot, before any workload schedules.
  4. Sets transparent huge pages to defer+madvise defrag with always enabled.
  5. (With --enable-cpu-pinning) patches the kubelet config to enable cpuManagerPolicy: static + topologyManagerPolicy: single-numa-node, clears cpu_manager_state, and restarts the kubelet. Vanilla kubeadm and k3s are auto-detected.

Legacy mode (Docker Compose only — no longer the canonical path):

sudo scripts/tune-ubuntu-host.sh
scripts/stack-up.sh restart-webserver

The legacy script (scripts/tune-ubuntu-host.sh) covers only the webserver Docker stack — host-tuning.sh supersedes it for Kubernetes deploys.

Why each knob matters

UDP buffers — critical for HTTP/3

Knob Default Ubuntu We set Why
net.core.rmem_max 208 KiB 7.5 MB Caddy's quic-go asks for 7 MiB receive buffer. Default = packet drops under load. Boot log without this fix: "failed to sufficiently increase receive buffer size (was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB)."
net.core.wmem_max 208 KiB 7.5 MB Same on send side
net.core.rmem_default / wmem_default 208 KiB 2.5 MB Per-socket starting size; raise so even short-lived connections benefit

TCP backlog & keep-alive — HTTP/2 and HTTP/1.1

Knob Default We set Why
net.core.somaxconn 4 096 65 535 Listener accept queue. Bursts of new HTTP/2 connections silently drop when this fills
net.ipv4.tcp_max_syn_backlog 1 024 65 535 Half-open connection backlog (SYN-RECEIVED state)
net.ipv4.tcp_synack_retries 5 3 How many times to retry SYN-ACK. 5× = 60 s wait per dead client
net.ipv4.tcp_tw_reuse 0 1 Allow recycling TIME_WAIT for new connections — useful when agents hammer the same webserver
net.ipv4.tcp_fin_timeout 60 15 Faster TIME_WAIT teardown
net.ipv4.tcp_keepalive_time 7 200 60 Detect dead peers in 60 s, not 2 h
net.ipv4.tcp_keepalive_probes 9 3 Three probes is enough
net.ipv4.tcp_keepalive_intvl 75 10 10 s between probes
net.ipv4.tcp_fastopen 1 3 TFO active for both client + server roles
net.core.netdev_max_backlog 1 000 16 384 Per-CPU NIC packet queue. 1 000 saturates at ~50 k pps on a single core

Congestion control — BBR

net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

BBR (Bottleneck Bandwidth and RTT) is dramatically better than the default cubic for high-throughput long-fat-network scenarios — typical for a webserver serving the public internet. Requires fq qdisc for correct pacing. On Ubuntu 22.04+ both are available out of the box; on older releases run:

modprobe tcp_bbr
echo tcp_bbr | sudo tee -a /etc/modules-load.d/bbr.conf

The verification step in tune-ubuntu-host.sh warns if BBR didn't activate.

File descriptors — system-wide ceiling

Knob Default Ubuntu 24.04 We set Why
fs.file-max ~9.2 M 26.2 M At 30 webservers (20 Synthetic + 10 Cloned slots) × 1 048 576 fd/container = ~32 M theoretical. Default is tight
fs.nr_open ~1 M 26.2 M Per-process ceiling — without this, the container-level ulimit nofile=1048576 can't actually be granted

Conntrack — Docker bridge networking

Docker bridge networks (used by ai_forse_oobi and ai_forse_prod) route through the host's iptables, which means every active connection sits in the conntrack table.

Knob Default We set Why
net.netfilter.nf_conntrack_max 65 536 1 048 576 Default fills under 50 k+ active conns → kernel logs nf_conntrack: table full, dropping packet
net.netfilter.nf_conntrack_buckets 16 384 262 144 Hash table buckets. Should be ~max/4 for good distribution
net.netfilter.nf_conntrack_tcp_timeout_established 432 000 (5 d) 1 200 (20 min) Default keeps dead connections in the table for 5 days

NIC ring buffers — tuned per device

The script reads each interface's max ring buffer size via ethtool -g and sets it. On a real NIC this is typically 4 096 RX + 4 096 TX (e.g. Intel ixgbe / Mellanox mlx5). On virtual NICs (QEMU virtio-net, vmxnet3, AWS ENA) the max is whatever the hypervisor exposes — the script falls back to a no-op if rejected.

Higher ring buffers = more headroom under packet bursts before the kernel starts dropping at the NIC level.

What we did NOT tune (yet)

These deliver real gains but require choices that depend on your specific hardware, so they're not in the auto-script:

Topic Why it's not auto-applied
IRQ affinity / RPS / RFS Optimal CPU-to-IRQ mapping depends on # cores, NUMA layout, NIC queues. Wrong choice can hurt instead of help. See /proc/interrupts + manual /proc/irq/N/smp_affinity
NUMA pinning (numactl --cpunodebind) Only useful on multi-socket servers (rare in cloud VMs)
Hugepages Caddy's working set is small; little benefit
Disabling Spectre/Meltdown mitigations Not recommended — usually 5-10 % throughput loss isn't worth the security trade-off
MTU 9000 (jumbo frames) Requires every hop in the path to support it. Internal-network only

If you need to push past what this doc gives you, the bottleneck almost certainly moves from kernel network stack to CPU — at which point you should profile Caddy with perf top -p $(pgrep caddy) and consider Layer B (custom Caddy build with GOAMD64=v3).

Container-level tuning (Layer A — opt-in)

The base docker-compose.webserver.yml ships without the kernel sysctls because Docker on macOS / Windows (LinuxKit VM) rejects writes to several /proc/sys/net/core/* files even from the container's own network namespace, with errors like:

OCI runtime create failed: open sysctl net.core.wmem_max:
reopen fd 8: permission denied

To opt in on a native Linux host (Ubuntu, Debian, RHEL, …), layer the tuning override:

WEBSERVER_LINUX_TUNING=1 scripts/stack-up.sh up

Or by hand:

docker compose -p ai_forse_webserver \
  -f docker-compose.webserver.yml \
  -f docker-compose.webserver.linux-tuning.yml \
  up -d

Or persistently in .env:

COMPOSE_FILE=docker-compose.webserver.yml:docker-compose.webserver.linux-tuning.yml

The docker-compose.webserver.linux-tuning.yml override layers in:

ulimits:
  nofile:
    soft: 1048576
    hard: 1048576
sysctls:
  net.core.rmem_max: '7500000'
  net.core.wmem_max: '7500000'
  net.core.rmem_default: '2500000'
  net.core.wmem_default: '2500000'
  net.core.somaxconn: '65535'
  net.ipv4.tcp_fastopen: '3'

These work without root on the host, and without privileged mode in the container — Docker scopes them to the container's network/process namespace.

The host-side tuning script (scripts/tune-ubuntu-host.sh) and the 99-ai-forse-webserver.conf sysctl drop-in do not depend on Layer A — they tune the kernel of the host itself, which benefits every container regardless of opt-in status. Recommended on every Ubuntu production host.

Verification

After applying, expect to see (no more) the boot warning:

{"level":"info","msg":"failed to sufficiently increase receive buffer size
(was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB)..."}

Inside a running webserver:

docker exec ai_forse_webserver-webserver-1 sh -c '
  cat /proc/sys/net/core/rmem_max
  cat /proc/sys/net/core/somaxconn
  cat /proc/sys/net/ipv4/tcp_fastopen
'
# Expected:
# 7500000
# 65535
# 3

On the host:

sysctl net.ipv4.tcp_congestion_control net.core.default_qdisc
# net.ipv4.tcp_congestion_control = bbr
# net.core.default_qdisc = fq

CPU pinning — cpuManagerPolicy: static (Kubernetes mode only)

The persona Caddy pods request integer CPU values with request == limit, which gives them Guaranteed QoS — the prerequisite for kubelet's exclusive-core allocator. Once cpuManagerPolicy: static is enabled, every Caddy pod gets cores fully reserved to it and never shares a core with another container. Eliminates the ~10–20 % p99 latency tail caused by context switches during TLS crypto bursts.

Apply on every node hosting personas

# /var/lib/kubelet/config.yaml on each ngfw-dut node
cat <<'EOF' | sudo tee -a /var/lib/kubelet/config.yaml
cpuManagerPolicy: static
topologyManagerPolicy: single-numa-node
reservedSystemCPUs: "0,1"   # cores 0+1 reserved for kubelet + system
EOF

# Static policy state lives at /var/lib/kubelet/cpu_manager_state — must
# be cleared once per node before the new policy takes effect.
sudo rm -f /var/lib/kubelet/cpu_manager_state
sudo systemctl restart kubelet

For k3s the kubelet config flag goes in /etc/rancher/k3s/config.yaml:

kubelet-arg:
  - cpu-manager-policy=static
  - topology-manager-policy=single-numa-node
  - reserved-cpus=0,1

Then sudo rm -f /var/lib/rancher/k3s/agent/kubelet/cpu_manager_state and sudo systemctl restart k3s (or k3s-agent).

Verify

# After the kubelet restart and a fresh persona pod schedule:
kubectl exec -n persona-blog deploy/caddy -- sh -c \
  'cat /sys/fs/cgroup/cpuset.cpus.effective'
# Should print a small contiguous range (e.g. "4-5"), not "0-63".

NUMA topology hint

If your UCS hosts have two NUMA nodes (typical), the topologyManagerPolicy: single-numa-node flag tells the scheduler to keep the pod's CPU + memory + (PCI device) on the same NUMA node. For Caddy this matters because the macvlan NIC subinterface lives on one NUMA node — a pod scheduled across nodes pays a QPI / UPI hop on every packet.

Persistent sysctls (across reboots)

The DaemonSet node-tuning (k8s/dut/85-node-tuning.yaml) applies all sysctls + kernel module loads at pod start, but the kernel reverts on host reboot. To survive reboots without waiting for the DaemonSet to come up, drop the same values into a sysctl.d file:

sudo tee /etc/sysctl.d/99-ai-forse-perf.conf > /dev/null <<'EOF'
# UDP / QUIC
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 26214400
net.core.wmem_default = 26214400
# TCP / HTTP/2
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
# Latency / cwnd / PMTU
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_mtu_probing = 1
net.core.so_max_pacing_rate = 0
# Connection capacity + TIME_WAIT
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_orphan_retries = 2
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_notsent_lowat = 131072
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_tw_buckets = 2000000
# conntrack
net.netfilter.nf_conntrack_max = 2097152
# Per-socket non-data
net.core.optmem_max = 131072
# PIDs + FDs
kernel.pid_max = 4194304
fs.file-max = 2097152
# Memory pressure
vm.swappiness = 5
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# NUMA / locality
kernel.numa_balancing = 0
vm.zone_reclaim_mode = 0
EOF

# Auto-load nf_conntrack + tls modules at boot.
echo 'nf_conntrack' | sudo tee /etc/modules-load.d/ai-forse.conf
echo 'tls'          | sudo tee -a /etc/modules-load.d/ai-forse.conf

# CPU governor — performance forever, not just until next reboot.
sudo tee /etc/systemd/system/cpu-perf.service > /dev/null <<'EOF'
[Unit]
Description=Pin all CPUs to the performance governor
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for f in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > "$f" 2>/dev/null || true; done'
RemainAfterExit=true

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now cpu-perf.service

sudo sysctl --system

After this, the DaemonSet node-tuning becomes a belt-and-braces idempotent re-applier rather than the only line of defence.

Rollback

sudo rm /etc/sysctl.d/99-ai-forse-webserver.conf
sudo rm /etc/sysctl.d/99-ai-forse-perf.conf
sudo rm /etc/modules-load.d/ai-forse.conf
sudo systemctl disable --now cpu-perf.service
sudo rm /etc/systemd/system/cpu-perf.service
sudo sysctl --system
# NIC ring buffers reset on next reboot, or:
sudo ethtool -G <nic> rx 256 tx 256
# Revert kubelet cpuManagerPolicy: edit /var/lib/kubelet/config.yaml,
# remove cpu_manager_state, restart kubelet.

The container-level sysctls are unconditional — to revert those edit docker-compose.webserver.yml and restart the stack.

Benchmarking the impact

A future PR ships benchmarks/ with h2load recipes for HTTP/2 and HTTP/3, plus a results/ directory with before/after baselines per host class. Run there once on a representative box, archive the JSON, and use it to detect performance regressions.

See also

  • k8s/dut/85-node-tuning.yaml — DaemonSet that runs every sysctl on pod-start (matches this drop-in)
  • scripts/host-tuning/99-ai-forse-webserver.conf — the legacy sysctl drop-in (Docker-mode only)
  • scripts/tune-ubuntu-host.sh — apply + verify wrapper
  • docker-compose.webserver.yml — container-level sysctls + ulimits
  • webserver/Caddyfile — application-level performance posture (TLS posture, timeouts, encode, curves)
  • quic-go UDP buffer wiki — upstream guidance for the receive buffer fix
  • Kubernetes CPU Manager docs — how cpuManagerPolicy: static interacts with QoS classes