Split-stack topology (6-stack model)¶
TL;DR — Five independent Compose projects on one host (or spread
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. across hosts), with traffic-segmented networks: control plane talks over an
--internalDocker bridge that has no internet access at all, while all agent fleets have a second NIC dedicated to outbound HTTPS against the target sites under test. Any stack restarts independently of the others. Same UX as the all-in-one, with the operational + security properties of the Kubernetes layout.
Why split?¶
All-in-one (docker-compose.yml) |
Split (docker-compose.{control,fleet}.yml) |
|
|---|---|---|
docker compose down |
tears down DB, dashboard AND every agent | tears down only the half you target |
| Update dashboard | recreates DB + agents (cascade) | recreates only dashboard; agents keep running |
| Postgres OOM | drags dashboard + agents | isolated to the control host |
| Per-host RAM allocation | one box for everything | dedicate one box for control, scale agents on others |
| Docker socket exposure | dashboard mounts host socket → 1 surface | only the control host needs the socket |
| Agent restart loop | risk of cascading via depends_on |
fleet is leaf; restarts don't bubble up |
| Network blast radius | one shared bridge → DB sees the same NAT/internet path as agents | OOBI is --internal, postgres + dashboard literally cannot reach the public internet |
| Migration to Kubernetes | conceptual gap | drop-in mapping (control = control plane Deployment + NetworkPolicy, fleet = agent Deployment + NetworkPolicy) |
Network segmentation (since v1.6.2 + v3.0.0 synthetic-load engine)¶
ai_forse_oobi (--internal — no NAT, no internet)
┌──────────────────────────────────────────────────────────────────────────┐
│ postgres ←── dashboard ──→ agent-1..N ──→ webserver-1..M k6agent-1..K │
│ (no internet) (no internet, +Docker.sock) (ctrl+scrape) (ctrl only) │
└──────────────────────────────────────────────────────────────────────────┘
│
│ outbound / inbound (NAT)
▼
ai_forse_prod (regular bridge, NAT)
┌─────────────────────────────────────────────┐
│ agent-1..N → internet (target sites) │
│ agent-1..N → webserver-1..M │
│ k6agent-1..K → internet (target sites) │
│ external client → webserver-1..M │
│ (HTTP/2, HTTP/3 / QUIC) │
└─────────────────────────────────────────────┘
postgres-1 ── attached to: [ai_forse_oobi] (NEVER ai_forse_prod)
dashboard-1 ── attached to: [ai_forse_oobi] (NEVER ai_forse_prod)
agent-N ── attached to: [ai_forse_oobi (priority 100),
ai_forse_prod (priority 1000) — default GW]
webserver-N ── attached to: [ai_forse_oobi (priority 100, /metrics, /healthz),
ai_forse_prod (priority 1000, :443 TCP + UDP)]
k6agent-K ── attached to: [ai_forse_oobi (priority 100),
ai_forse_prod (priority 1000) — default GW]
What this buys you:
- Defense in depth — a compromised agent (RCE via crafted page) cannot pivot from
ai_forse_prodintoai_forse_oobiwithout the auth token. - No exfiltration from the control plane — postgres + dashboard have no default gateway. Even if compromised, they cannot ship data to an attacker-controlled host. They can only see (a) other containers on
ai_forse_oobiand (b) the host's Docker socket (dashboard only, for the autoscaler). - Clean separation of test traffic from management traffic — easier WAF / firewall / QoS rules. Carrier-grade
OOBI vs productionposture. - Predictable bandwidth accounting — sum the bytes on
ai_forse_prodto know how much real-world test traffic the fleet generated. Sum the bytes onai_forse_oobito know your own management overhead.
The Compose priority attribute on the agent's two-network attachment (priority 100 on OOBI, 1000 on PROD) makes the prod NIC the default route. Belt + suspenders: even if priorities were equal, OOBI is --internal so Docker would refuse to make it the default gateway anyway.
What stays the same¶
- Same images (
ghcr.io/nollagluiz/web-agent-{agent,dashboard}) - Same config, same secrets, same
.env - Same database schema and migrations (auto-run by the dashboard at boot)
- Same scaler logic — the dashboard mounts the host's Docker socket and scales agents in the other Compose project by label match
- Same
/var/run/docker.sockpermission model — only the control host needsDOCKER_GIDset; the fleet host(s) don't see the socket at all - All-in-one (
docker-compose.yml) keeps working unchanged for dev / laptop / smoke tests
Files¶
# ── Standard stacks (Docker bridge / NAT) ──────────────────────────────────
docker-compose.yml # all-in-one (dev, laptop, smoke tests)
docker-compose.control.yml # split stack 1 — postgres + dashboard
docker-compose.fleet.yml # split stack 2 — browser-engine agents (1..300)
docker-compose.webserver.yml # split stack 3 — Caddy webservers (HTTP/2 + HTTP/3)
docker-compose.k6-fleet.yml # split stack 4 — synthetic-load engine load-test agents (1..1,000)
docker-compose.observability.yml # split stack 5 — Prometheus + Grafana + Loki
docker-compose.cloner.yml # split stack 6 — Public Website Cloner (dual NIC: OOBI + ISP)
# ── DUT test-bed stacks (802.1q macvlan — physical NGFW in the path) ───────
docker-compose.dut.playwright.yml # DUT stack A — browser engine on VLAN 20 macvlan
docker-compose.dut.k6.yml # DUT stack B — synthetic-load agents on VLAN 30 macvlan
docker-compose.dut.observability.yml # DUT overlay — adds snmp_exporter, overrides Prometheus
# ── DUT support scripts / config ───────────────────────────────────────────
scripts/netsetup-dut.sh # 802.1q subinterfaces + macvlan networks setup
observability/snmp/snmp.yml # SNMP module defs: 10 vendor modules
observability/prometheus/prometheus.dut.yml # Prometheus config for DUT mode
observability/prometheus/targets/ubuntu-hosts.yml # node_exporter hosts (file SD)
observability/grafana/dashboards/nexus9000.json # Nexus 9000 switch dashboard
observability/grafana/dashboards/ngfw-dut.json # NGFW DUT dashboard
observability/grafana/dashboards/ubuntu-hosts.json # Ubuntu / Cisco UCS host dashboard
# ── Overrides (apply to both standard and DUT stacks) ──────────────────────
docker-compose.ghcr.yml # pins image tags (GHCR)
docker-compose.no-scaler.yml # opts out of the local autoscaler
# ── Convenience wrapper ─────────────────────────────────────────────────────
scripts/stack-up.sh # verbs: up / down / scale / restart-* /
# up-dut / down-dut / destroy-dut / ps-dut
One-time setup¶
# Two segmented networks. The OOBI one is --internal so containers on it
# cannot reach the public internet (this is what postgres + dashboard
# attach to, exclusively).
docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge ai_forse_prod
# Or — easier — let the helper script create both with the right flags
# automatically on first `up`:
scripts/stack-up.sh up
Add to your .env (in addition to the existing variables):
# Tell the dashboard's autoscaler which Compose project owns the agent
# containers. Mismatch produces a clear error in /api/admin/scale/reconcile.
FLEET_PROJECT_NAME=ai_forse_fleet
# OPTIONAL — override the data-plane network's CIDR if Docker's default
# (172.x range) collides with a corporate VPN range. Only consumed by
# `scripts/stack-up.sh` when it creates the ai_forse_prod network for
# the first time. Has no effect once the network already exists.
# PROD_SUBNET=10.250.0.0/16
Migration from v1.6.1's single network¶
v1.6.1 used a single ai_forse_bridge network. v1.6.2+ replaces it
with the two segmented networks above. To migrate cleanly:
# 1) Stop the stack (keeps the postgres volume)
scripts/stack-up.sh down
# 2) Drop the legacy network (only safe if nothing else on this host
# is attached to it)
docker network rm ai_forse_bridge
# 3) Bring everything back up — the helper creates the two new networks
scripts/stack-up.sh up
The helper warns you on up if the legacy network is still present.
Bring up the stack¶
# Easiest — uses the helper script, picks up the FLEET_PROJECT_NAME
# automatically and creates the network if missing
scripts/stack-up.sh up
# Or by hand
docker compose -p ai_forse_control -f docker-compose.control.yml up -d
docker compose -p ai_forse_fleet -f docker-compose.fleet.yml up -d
After ~30 s:
scripts/stack-up.sh ps
# === control ===
# NAME STATUS PORTS
# ai_forse_control-postgres-1 Up 30s (healthy) 127.0.0.1:5432->5432/tcp
# ai_forse_control-dashboard-1 Up 25s (healthy) 127.0.0.1:3000->3000/tcp
#
# === fleet ===
# NAME STATUS
# ai_forse_fleet-agent-1 Up 20s (healthy)
Day-to-day operations¶
# Bounce the dashboard — agents stay up. PR #56 retry-on-register
# absorbs the brief 503 window cleanly; you'll see a few "register
# attempt failed" log lines, then "registered with controller (recover)"
# inside ~10s.
scripts/stack-up.sh restart-control
# Bounce the agents — dashboard + DB stay up.
scripts/stack-up.sh restart-fleet
# Scale the fleet manually
scripts/stack-up.sh scale 25
# Or use the slider in /agents — works identically (autoscaler reads
# the Docker socket and finds containers by their compose label).
# Tail logs from both stacks at once
scripts/stack-up.sh logs
# Stop everything (preserves volumes)
scripts/stack-up.sh down
# Wipe everything — including the Postgres volume!
scripts/stack-up.sh destroy
Migrating from the all-in-one¶
You can switch without losing data. The Postgres volume is named
ai_forse_postgres_data in the all-in-one and
ai_forse_control_postgres_data in the split. Either:
Option A — fresh start (recommended for non-prod)¶
docker compose down # stop the all-in-one
scripts/stack-up.sh up # bring up the split (new empty DB)
# Open /setup, recreate admin user + targets, move slider.
Option B — preserve the database¶
# 1) Dump the all-in-one DB
docker compose exec -T postgres pg_dump -U agent_dashboard agent_dashboard \
> /tmp/agent_dashboard.sql
# 2) Stop the all-in-one
docker compose down
# 3) Bring up the split (control plane only, with empty DB)
docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge ai_forse_prod
docker compose -p ai_forse_control -f docker-compose.control.yml up -d postgres
# 4) Wait for Postgres to be healthy
docker compose -p ai_forse_control exec -T postgres pg_isready
# 5) Restore
docker compose -p ai_forse_control exec -T postgres psql -U agent_dashboard \
agent_dashboard < /tmp/agent_dashboard.sql
# 6) Bring up dashboard + fleet
scripts/stack-up.sh up
The dashboard's auto-migration runner (PR #33) is idempotent, so even if the dump came from an older minor it'll catch the schema up.
Multi-host deployment¶
The split topology runs naturally across two physical machines:
┌─ Host A (control) ────────────┐ ┌─ Host B / C / D (fleet) ──────────┐
│ │ │ │
│ docker-compose.control.yml │◄──►│ docker-compose.fleet.yml │
│ ai_forse_oobi (local) │ │ ai_forse_oobi (local, --internal)│
│ │ │ ai_forse_prod (local, NAT) │
│ │ │ CONTROLLER_URL= │
│ │ │ http://host-a.internal:3000 │
└───────────────────────────────┘ └────────────────────────────────────┘
Note on multi-host networks: Docker bridge networks are host-local. In multi-host deployments each host creates its own
ai_forse_oobiandai_forse_prod, and inter-host traffic flows over the host's physical NIC (the agent'sCONTROLLER_URLpoints at a real DNS name / IP). For seamless cross-host service discovery use Docker Swarm overlay networks or, better, Kubernetes — seek8s/.
On host A (control):
docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge ai_forse_prod
docker compose -p ai_forse_control -f docker-compose.control.yml up -d
On hosts B/C/D (fleet), with .env containing
CONTROLLER_URL=http://host-a.internal:3000:
docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge ai_forse_prod
docker compose -p ai_forse_fleet -f docker-compose.fleet.yml up -d \
--scale agent=20
Multi-host caveat: the dashboard's local autoscaler talks to the Docker socket of host A, so it can only scale containers running on host A. For multi-host fleets you have three options:
- Disable the autoscaler (use
docker-compose.no-scaler.ymlon the control side) and scale each host manually with--scale agent=Nper host.- Use Kubernetes — it's literally the same topology and the K8s HPA scales across nodes for free. See
k8s/.- Run a Docker Swarm — same Compose files, just
docker stack deploy. The dashboard's scaler doesn't know about Swarm, so item 1 still applies.
Resource budgeting¶
Suggested baselines per topology size:
| Fleet size | Control host | Per-fleet host |
|---|---|---|
| 1–10 agents | 1 vCPU, 2 GiB | n/a (same host as control) |
| 10–30 agents | 1 vCPU, 2 GiB | 4 vCPU, 8 GiB |
| 30–100 agents | 2 vCPU, 4 GiB (shared_buffers=1GB) |
16 vCPU, 32 GiB |
| 100+ agents | 4 vCPU, 8 GiB (shared_buffers=2GB) + dedicated agent hosts |
32 vCPU, 64 GiB each, 1 agent ≈ 250 MiB RSS |
Per-agent memory ceiling is enforced by mem_limit: 1g (1 GiB cgroup
hard-cap) and pids_limit: 500 (Chromium runaway prevention).
The webserver stack (since v1.7.0)¶
The ai_forse_webserver Compose project deploys a pool of up to
20 Caddy 2.8 containers serving HTTP/2 (TCP) and HTTP/3 (QUIC over
UDP) at maximum throughput. Same network model as the agent fleet:
two NICs per container, with the OOBI side carrying the dashboard's
/metrics scrape and the PROD side carrying real client traffic.
What each webserver serves:
| Endpoint | What you get | Why |
|---|---|---|
/ |
Static landing page (4 KB HTML + 1 CSS + 1 JS) | Smoke test |
/static/small |
4 KB binary | Tiny payload baseline |
/static/medium |
100 KB binary | Mid payload — headers compression test |
/static/large |
1 MB binary | Large payload — multiplexing test |
/api/echo |
JSON with protocol + request introspection | "What HTTP version did I just talk?" |
/healthz |
200 OK | Docker HEALTHCHECK |
:9091/metrics |
Prometheus format | Dashboard scrape (OOBI only) |
Performance posture of the Caddyfile:
- HTTP/3 enabled by default via
protocols h1 h2 h3 Alt-Svcheader advertisesh3=":443"so capable clients upgrade- Brotli + zstd + gzip compression with 1 KB minimum
- Caddy serves static via
sendfile/splicezero-copy - Idle timeout 2 min, read header timeout 10 s
- TLS via Caddy's internal CA (self-signed). Switch to ACME by
removing
local_certsand addingemail …in the global block. - Metrics endpoint isolated on
:9091so the high-cardinality scrape doesn't mix with client traffic on:443
Scaling:
# Bring up at default scale (1)
scripts/stack-up.sh up
# Scale to 12 webservers
scripts/stack-up.sh scale-webserver 12
# Or via the dashboard slider in /webservers (added in PR #C)
Why the data plane port isn't published to the host¶
At scale (up to 20 replicas) Compose can't bind 20 containers to
host port 443. Other containers on ai_forse_prod (the agents)
reach each webserver by its container DNS name, e.g.
https://ai_forse_webserver-webserver-3:443/. For browser smoke
testing from the host, override one replica to publish a port:
docker compose -p ai_forse_webserver -f docker-compose.webserver.yml \
run --rm --service-ports -p 8443:443 webserver
The synthetic-load fleet stack (since v3.0.0)¶
The ai_forse_k6 Compose project deploys a pool of up to 1,000 synthetic-load engine
load-test agents. Each agent is a lightweight TypeScript + Node.js process
(~128 MB RAM) that wraps the official grafana/k6 binary. No browser, no
Chromium — just the k6 Go HTTP stack hitting target URLs with configurable
VUs and duration.
Network model: identical to the browser-engine fleet — two NICs per container, OOBI side for dashboard API calls, PROD side for outbound test traffic.
| Property | Value |
|---|---|
| Docker project | ai_forse_k6 |
| Compose file | docker-compose.k6-fleet.yml |
| Service name | k6agent |
| Max agents (Compose) | 1,000 |
| Max agents (K8s HPA) | 1,000 |
| RAM per agent | ~128 MiB |
| K8s scale-to-zero | yes (KEDA or Kubernetes ≥ 1.32) |
Scaling:
# Bring up synthetic-load fleet with 5 agents
scripts/stack-up.sh up
scripts/stack-up.sh scale-k6 5
# Or via the dashboard slider at /agents/k6
Why the synthetic-load fleet is a separate Compose project (not a second service in
docker-compose.fleet.yml):
- Independent scaling: browser-engine fleet and synthetic-load fleet have very different capacity characteristics (browser memory vs. CPU threads). Keeping them in separate projects allows scaling each without touching the other.
- Independent restart: A synthetic-load agent crash loop doesn't bounce browser engine
agents. Operators can
scripts/stack-up.sh restart-k6without interrupting active browser cycles. - Independent image lifecycle: synthetic-load engine image updates don't require rebuilding the browser engine image, and vice versa.
For the full synthetic-load engine operations guide, see docs/K6_FLEET.md.
DUT test-bed mode (physical NGFW)¶
Since the
feat/k6-executor-scriptswork stream (2026-05), the stack supports an optional physical-DUT extension that places the NGFW under test in the data path for every agent-to-webserver connection. All three agent fleets (browser engine, synthetic-load engine, webservers) are moved from theai_forse_prodbridge onto per-VLAN macvlan networks on a 802.1q trunk; the NGFW is the L3 gateway for all of them.
DUT topology — additional compose files¶
docker-compose.dut.playwright.yml # browser engine on VLAN 20 macvlan (ai_forse_dut_pw)
docker-compose.dut.k6.yml # synthetic-load agents on VLAN 30 macvlan (ai_forse_dut_k6)
docker-compose.dut.observability.yml # Observability overlay: adds snmp_exporter +
# overrides prometheus to prometheus.dut.yml
scripts/netsetup-dut.sh # Creates 802.1q subinterfaces + macvlan networks
scripts/stack-up.sh # Extended: up-dut / down-dut / destroy-dut / ps-dut
observability/snmp/snmp.yml # SNMP module definitions (10 vendor modules)
observability/prometheus/prometheus.dut.yml # Prometheus config for DUT mode
observability/prometheus/targets/ubuntu-hosts.yml # node_exporter host list (file SD)
observability/grafana/dashboards/nexus9000.json # Nexus 9000 switch dashboard
observability/grafana/dashboards/ngfw-dut.json # NGFW DUT dashboard
observability/grafana/dashboards/ubuntu-hosts.json # Ubuntu/UCS host dashboard
DUT network segments¶
ai_forse_oobi (--internal bridge)
┌──────────────────────────────────────────────────────────────────┐
│ postgres dashboard Prometheus Grafana snmp_exporter │
└──────────────────────────────────────────────────────────────────┘
│ control-plane API calls (priority 100)
Webservers: K8s persona namespaces only.
┌────────────────────────────────────────────────┐
│ 20 Synthetic Persona namespaces (VLANs 101-120, 10.1.x.0/27) │
│ 10 Cloned Persona slots (VLANs 200-209, 10.2.x.0/27) │
│ NGFW .1 is default GW on every persona /27. │
│ (Legacy Docker-mode VLAN 10 webserver layer removed v3.7.0.) │
└────────────────────────────────────────────────┘
VLAN 20 ai_forse_dut_pw 172.16.0.0/16 macvlan on eth1.20
┌────────────────────────────────────────────────┐
│ browser-engine agents (NGFW .1 is default GW) │
└────────────────────────────────────────────────┘
VLAN 30 ai_forse_dut_k6 172.17.0.0/16 macvlan on eth1.30
┌────────────────────────────────────────────────┐
│ synthetic-load agents (NGFW .1 is default GW) │
└────────────────────────────────────────────────┘
VLAN 99 ai_forse_mgmt 192.168.90.0/24 macvlan on eth1.99
┌────────────────────────────────────────────────┐
│ snmp_exporter only — reaches MGMT0 (Nexus) │
│ and MGMT interface (NGFW) via SNMP UDP/161 │
└────────────────────────────────────────────────┘
│
Cisco Nexus 9000
(L2 trunk: VLANs 10/20/30/99)
│
NGFW DUT (L3 gateway .1 per VLAN)
Each container in the DUT compose files has two NICs:
- ai_forse_oobi (priority 100) — control plane: dashboard API, Prometheus scrape.
- ai_forse_dut_{web,pw,k6} (priority 1000) — data plane, default route → NGFW.
One-time host setup (Linux only)¶
# Create 802.1q subinterfaces and Docker macvlan networks.
# DUT_DATA_IFACE = the trunk NIC connected to the Nexus 9000.
sudo DUT_DATA_IFACE=eth1 scripts/netsetup-dut.sh setup
# Verify
scripts/netsetup-dut.sh status
Set DUT variables in .env:
SNMP_NEXUS_HOST=192.168.90.2 # Nexus 9000 MGMT0 IP on VLAN 99
SNMP_NGFW_HOST=192.168.90.3 # NGFW management interface IP
SNMP_COMMUNITY=public # SNMPv2c community string
SNMP_DUT_MODULE=cisco_ftd # Choose the right SNMP module for your NGFW:
# cisco_ftd / cisco_iosxe / cisco_meraki /
# fortinet_fortigate / palo_alto /
# checkpoint / huawei_ngfw / generic_ngfw
Bring up / down the DUT stack¶
# Bring up all DUT stacks (webservers + agents + observability overlay)
scripts/stack-up.sh up-dut
# Scale synthetic-load engine to 10 agents for load testing
scripts/stack-up.sh scale-k6 10
# Status of all DUT services
scripts/stack-up.sh ps-dut
# Tear down (stops containers; macvlan networks + volumes remain)
scripts/stack-up.sh down-dut
# Full teardown including macvlan networks
scripts/stack-up.sh destroy-dut
sudo DUT_DATA_IFACE=eth1 scripts/netsetup-dut.sh teardown
SNMP monitoring — supported NGFW vendors¶
SNMP_DUT_MODULE |
Device | Key vendor-specific metrics |
|---|---|---|
cisco_ftd |
Cisco Firepower FTD 7.x | CISCO-FIREWALL-MIB sessions/CPS, CRYPTO-ACCEL HW engine |
cisco_iosxe |
Cisco C8475-G2 IOS-XE | PROCESS-MIB CPU, MEMORY-POOL-MIB, queue drops |
cisco_meraki |
Meraki MX450 | IF-MIB only (Dashboard API needed for CPU/sessions) |
fortinet_fortigate |
FortiGate FG200F/FG600F | fgSysCpuUsage, fgSysSesCount, fgSysSesRate1, VDOMs |
palo_alto |
PA-series / VM-series | panSessionActive, panSysCpuUtilization, panSessionThroughput |
checkpoint |
Quantum / Gaia R81.x | fwNumConn, svnPerfCPU, svnPerfMem |
huawei_ngfw |
Huawei USG / NGFW | hwEntityCPUUsage, hwEntityMemUsage, temperature |
generic_ngfw |
Any vendor | Standard MIBs: IF-MIB, HOST-RESOURCES-MIB |
All modules share a common set of IF-MIB metrics (ifHCInOctets,
ifHCOutOctets, ifOperStatus, dot3StatsFCSErrors) that work with the
NGFW DUT Grafana dashboard regardless of vendor.
Ubuntu host monitoring (Cisco UCS)¶
Physical Ubuntu hosts (Cisco UCS servers) running on the MGMT network are
monitored with node_exporter:
# Deploy node_exporter on each Ubuntu/UCS host (run as root on the host)
docker run -d --name node_exporter --net host --pid host \
--restart unless-stopped \
-v /:/host:ro,rslave \
prom/node-exporter:v1.8.2 \
--path.rootfs=/host \
--collector.systemd \
--collector.processes
Then add the host IP to observability/prometheus/targets/ubuntu-hosts.yml:
- targets:
- 192.168.90.10:9100 # ucs-host-01
- 192.168.90.11:9100 # ucs-host-02
labels:
role: ubuntu_host
rack: ucs
Prometheus picks up the file every 30 s — no restart needed.
The Ubuntu Hosts (UCS) Grafana dashboard (dut-ubuntu-hosts) shows:
CPU modes, load averages, memory breakdown, swap, disk I/O, IOPS,
filesystem usage, network bps/pps/errors, open FDs, processes, and
Docker container count.
For UCS chassis-level hardware health (CIMC temperature, fans, PSU),
enable SNMP on each server's CIMC and add a separate scrape job using
the generic_ngfw module pointing at the CIMC management IP.
When NOT to split¶
- Laptop dev with
docker-compose up -d, two browser tabs and a half-baked target list — the all-in-one is faster and there's no real isolation benefit. - CI smoke tests that only need a 60-second sanity check.
- Demos where you need a single-command up.
For everything else (production, staging, or "I want to scale past 20 agents"), prefer the split.
Production Ubuntu host tuning¶
Once per host, run the Ubuntu performance tuning script — it applies the UDP/TCP buffer sizes the Caddy webserver fleet needs for HTTP/2 + HTTP/3 throughput, switches the congestion control to BBR with fq qdisc, and bumps the NIC ring buffers:
sudo scripts/tune-ubuntu-host.sh
scripts/stack-up.sh restart-webserver
Details and rationale per knob in docs/PERFORMANCE_TUNING_HOST.md. Without this the Caddy boot log shows failed to sufficiently increase receive buffer size and HTTP/3 drops packets under load.
See also¶
docker-compose.control.yml— control plane Compose filedocker-compose.fleet.yml— browser-engine fleet Compose filedocker-compose.webserver.yml— webserver Compose file (container-level kernel tuning)docker-compose.k6-fleet.yml— synthetic-load fleet Compose filedocker-compose.observability.yml— Prometheus + Grafana + Lokidocker-compose.cloner.yml— Public Website Cloner (dual NIC:ai_forse_oobi+ai_forse_isp)docker-compose.dut.*.yml— DUT test-bed Compose files (macvlan, physical NGFW)docker-compose.dut.observability.yml— DUT observability overlay (snmp_exporter)scripts/stack-up.sh— convenience wrapper (up/down/scale/restart-*/up-dut/down-dut)scripts/netsetup-dut.sh— 802.1q VLAN subinterfaces + macvlan network setupscripts/tune-ubuntu-host.sh— Ubuntu host tuning + NIC tuning + verification reportdocs/PERFORMANCE_TUNING_HOST.md— full tuning guidedocs/K6_FLEET.md— synthetic-load fleet operations guidedocs/DUT_TESTBED.md— physical NGFW test-bed setup guide (see below)k8s/— production Kubernetes manifests (same topology)docs/ARCHITECTURE.md— overall system architecture (includes DUT topology diagram)
Stack 6 — Public Website Cloner (docker-compose.cloner.yml)¶
The cloner is an optional, isolated stack that downloads real public websites using headful stealth browser engine and serves the static mirror to the persona fleet.
What it does¶
- Admin creates a clone job via the Dashboard API (
POST /api/clone/jobs {url, personaName}) - The cloner pod claims the job atomically (PostgreSQL
SKIP LOCKED) - browser engine navigates the target URL with anti-bot stealth, captures all assets
- Assets are written to
/mnt/cloned/{personaName}/(PVC in K8s, host volume in Compose) - The clone-serve HTTP server (
:8081) serves the mirror:/{personaName}/index.html - browser-engine fleet tests can point at
http://clone-serve:8081/{persona}/through the NGFW
Dual-network isolation¶
ai_forse_oobi ──► Dashboard API (management traffic, no internet)
ai_forse_isp ──► Public internet (site downloads, ICMP health pings)
The ai_forse_isp network is a standard Docker bridge without internal: true, so it has NAT internet access. On Linux the host must have IP forwarding enabled (net.ipv4.ip_forward=1).
Internet health monitor¶
The cloner pings 8.8.8.8 and 1.1.1.1 every 10 seconds and exposes the results as Prometheus metrics at :8081/metrics. The TLSStress.Art Grafana dashboard shows a live green/red "Acesso Internet (ISP)" indicator.
Bring up / down the cloner¶
# Prerequisites: control stack must be running (dashboard + postgres)
CONTROLLER_TOKEN=<same-as-AGENT_API_TOKEN> \
docker compose -f docker-compose.cloner.yml up -d
# Logs
docker compose -f docker-compose.cloner.yml logs -f cloner
# Check internet health metrics
curl http://localhost:8081/metrics | grep cloner_
# Stop
docker compose -f docker-compose.cloner.yml down
See also¶
docs/CLONER_OPERATIONS.md— full ops guide, API reference, troubleshootingk8s/80-cloner-nad.yaml— ISP NetworkAttachmentDefinitionk8s/81-cloner-deployment.yaml— Kubernetes Deployment