Split-stack topology (6-stack model)¶

TL;DR — Five independent Compose projects on one host (or spread

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. across hosts), with traffic-segmented networks: control plane talks over an --internal Docker bridge that has no internet access at all, while all agent fleets have a second NIC dedicated to outbound HTTPS against the target sites under test. Any stack restarts independently of the others. Same UX as the all-in-one, with the operational + security properties of the Kubernetes layout.

Why split?¶

	All-in-one (`docker-compose.yml`)	Split (`docker-compose.{control,fleet}.yml`)
`docker compose down`	tears down DB, dashboard AND every agent	tears down only the half you target
Update dashboard	recreates DB + agents (cascade)	recreates only dashboard; agents keep running
Postgres OOM	drags dashboard + agents	isolated to the control host
Per-host RAM allocation	one box for everything	dedicate one box for control, scale agents on others
Docker socket exposure	dashboard mounts host socket → 1 surface	only the control host needs the socket
Agent restart loop	risk of cascading via `depends_on`	fleet is leaf; restarts don't bubble up
Network blast radius	one shared bridge → DB sees the same NAT/internet path as agents	OOBI is `--internal`, postgres + dashboard literally cannot reach the public internet
Migration to Kubernetes	conceptual gap	drop-in mapping (control = control plane Deployment + NetworkPolicy, fleet = agent Deployment + NetworkPolicy)

Network segmentation (since v1.6.2 + v3.0.0 synthetic-load engine)¶

                ai_forse_oobi   (--internal — no NAT, no internet)
   ┌──────────────────────────────────────────────────────────────────────────┐
   │  postgres ←── dashboard ──→ agent-1..N ──→ webserver-1..M  k6agent-1..K │
   │  (no internet)  (no internet, +Docker.sock)   (ctrl+scrape) (ctrl only) │
   └──────────────────────────────────────────────────────────────────────────┘
                                                │
                                                │ outbound / inbound (NAT)
                                                ▼
                            ai_forse_prod  (regular bridge, NAT)
                            ┌─────────────────────────────────────────────┐
                            │  agent-1..N   → internet (target sites)     │
                            │  agent-1..N   → webserver-1..M              │
                            │  k6agent-1..K → internet (target sites)     │
                            │  external client → webserver-1..M           │
                            │  (HTTP/2, HTTP/3 / QUIC)                    │
                            └─────────────────────────────────────────────┘

postgres-1     ── attached to: [ai_forse_oobi]                              (NEVER ai_forse_prod)
dashboard-1    ── attached to: [ai_forse_oobi]                              (NEVER ai_forse_prod)
agent-N        ── attached to: [ai_forse_oobi (priority 100),
                                ai_forse_prod (priority 1000) — default GW]
webserver-N    ── attached to: [ai_forse_oobi (priority 100,  /metrics, /healthz),
                                ai_forse_prod (priority 1000, :443 TCP + UDP)]
k6agent-K      ── attached to: [ai_forse_oobi (priority 100),
                                ai_forse_prod (priority 1000) — default GW]

What this buys you:

Defense in depth — a compromised agent (RCE via crafted page) cannot pivot from ai_forse_prod into ai_forse_oobi without the auth token.
No exfiltration from the control plane — postgres + dashboard have no default gateway. Even if compromised, they cannot ship data to an attacker-controlled host. They can only see (a) other containers on ai_forse_oobi and (b) the host's Docker socket (dashboard only, for the autoscaler).
Clean separation of test traffic from management traffic — easier WAF / firewall / QoS rules. Carrier-grade OOBI vs production posture.
Predictable bandwidth accounting — sum the bytes on ai_forse_prod to know how much real-world test traffic the fleet generated. Sum the bytes on ai_forse_oobi to know your own management overhead.

The Compose priority attribute on the agent's two-network attachment (priority 100 on OOBI, 1000 on PROD) makes the prod NIC the default route. Belt + suspenders: even if priorities were equal, OOBI is --internal so Docker would refuse to make it the default gateway anyway.

What stays the same¶

Same images (ghcr.io/nollagluiz/web-agent-{agent,dashboard})
Same config, same secrets, same .env
Same database schema and migrations (auto-run by the dashboard at boot)
Same scaler logic — the dashboard mounts the host's Docker socket and scales agents in the other Compose project by label match
Same /var/run/docker.sock permission model — only the control host needs DOCKER_GID set; the fleet host(s) don't see the socket at all
All-in-one (docker-compose.yml) keeps working unchanged for dev / laptop / smoke tests

Files¶

# ── Standard stacks (Docker bridge / NAT) ──────────────────────────────────
docker-compose.yml                 # all-in-one (dev, laptop, smoke tests)
docker-compose.control.yml         # split stack 1 — postgres + dashboard
docker-compose.fleet.yml           # split stack 2 — browser-engine agents (1..300)
docker-compose.webserver.yml       # split stack 3 — Caddy webservers (HTTP/2 + HTTP/3)
docker-compose.k6-fleet.yml        # split stack 4 — synthetic-load engine load-test agents (1..1,000)
docker-compose.observability.yml   # split stack 5 — Prometheus + Grafana + Loki
docker-compose.cloner.yml          # split stack 6 — Public Website Cloner (dual NIC: OOBI + ISP)

# ── DUT test-bed stacks (802.1q macvlan — physical NGFW in the path) ───────
docker-compose.dut.playwright.yml  # DUT stack A — browser engine on VLAN 20 macvlan
docker-compose.dut.k6.yml          # DUT stack B — synthetic-load agents on VLAN 30 macvlan
docker-compose.dut.observability.yml # DUT overlay — adds snmp_exporter, overrides Prometheus

# ── DUT support scripts / config ───────────────────────────────────────────
scripts/netsetup-dut.sh            # 802.1q subinterfaces + macvlan networks setup
observability/snmp/snmp.yml        # SNMP module defs: 10 vendor modules
observability/prometheus/prometheus.dut.yml  # Prometheus config for DUT mode
observability/prometheus/targets/ubuntu-hosts.yml  # node_exporter hosts (file SD)
observability/grafana/dashboards/nexus9000.json    # Nexus 9000 switch dashboard
observability/grafana/dashboards/ngfw-dut.json     # NGFW DUT dashboard
observability/grafana/dashboards/ubuntu-hosts.json # Ubuntu / Cisco UCS host dashboard

# ── Overrides (apply to both standard and DUT stacks) ──────────────────────
docker-compose.ghcr.yml        # pins image tags (GHCR)
docker-compose.no-scaler.yml   # opts out of the local autoscaler

# ── Convenience wrapper ─────────────────────────────────────────────────────
scripts/stack-up.sh            # verbs: up / down / scale / restart-* /
                               #        up-dut / down-dut / destroy-dut / ps-dut

One-time setup¶

# Two segmented networks. The OOBI one is --internal so containers on it
# cannot reach the public internet (this is what postgres + dashboard
# attach to, exclusively).
docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge            ai_forse_prod

# Or — easier — let the helper script create both with the right flags
# automatically on first `up`:
scripts/stack-up.sh up

Add to your .env (in addition to the existing variables):

# Tell the dashboard's autoscaler which Compose project owns the agent
# containers. Mismatch produces a clear error in /api/admin/scale/reconcile.
FLEET_PROJECT_NAME=ai_forse_fleet

# OPTIONAL — override the data-plane network's CIDR if Docker's default
# (172.x range) collides with a corporate VPN range. Only consumed by
# `scripts/stack-up.sh` when it creates the ai_forse_prod network for
# the first time. Has no effect once the network already exists.
#   PROD_SUBNET=10.250.0.0/16

Migration from v1.6.1's single network¶

v1.6.1 used a single ai_forse_bridge network. v1.6.2+ replaces it with the two segmented networks above. To migrate cleanly:

# 1) Stop the stack (keeps the postgres volume)
scripts/stack-up.sh down

# 2) Drop the legacy network (only safe if nothing else on this host
#    is attached to it)
docker network rm ai_forse_bridge

# 3) Bring everything back up — the helper creates the two new networks
scripts/stack-up.sh up

The helper warns you on up if the legacy network is still present.

Bring up the stack¶

# Easiest — uses the helper script, picks up the FLEET_PROJECT_NAME
# automatically and creates the network if missing
scripts/stack-up.sh up

# Or by hand
docker compose -p ai_forse_control -f docker-compose.control.yml up -d
docker compose -p ai_forse_fleet   -f docker-compose.fleet.yml   up -d

After ~30 s:

scripts/stack-up.sh ps

# === control ===
# NAME                          STATUS              PORTS
# ai_forse_control-postgres-1   Up 30s (healthy)    127.0.0.1:5432->5432/tcp
# ai_forse_control-dashboard-1  Up 25s (healthy)    127.0.0.1:3000->3000/tcp
#
# === fleet ===
# NAME                          STATUS
# ai_forse_fleet-agent-1        Up 20s (healthy)

Day-to-day operations¶

# Bounce the dashboard — agents stay up. PR #56 retry-on-register
# absorbs the brief 503 window cleanly; you'll see a few "register
# attempt failed" log lines, then "registered with controller (recover)"
# inside ~10s.
scripts/stack-up.sh restart-control

# Bounce the agents — dashboard + DB stay up.
scripts/stack-up.sh restart-fleet

# Scale the fleet manually
scripts/stack-up.sh scale 25

# Or use the slider in /agents — works identically (autoscaler reads
# the Docker socket and finds containers by their compose label).

# Tail logs from both stacks at once
scripts/stack-up.sh logs

# Stop everything (preserves volumes)
scripts/stack-up.sh down

# Wipe everything — including the Postgres volume!
scripts/stack-up.sh destroy

Migrating from the all-in-one¶

You can switch without losing data. The Postgres volume is named ai_forse_postgres_data in the all-in-one and ai_forse_control_postgres_data in the split. Either:

Option A — fresh start (recommended for non-prod)¶

docker compose down                # stop the all-in-one
scripts/stack-up.sh up             # bring up the split (new empty DB)
# Open /setup, recreate admin user + targets, move slider.

Option B — preserve the database¶

# 1) Dump the all-in-one DB
docker compose exec -T postgres pg_dump -U agent_dashboard agent_dashboard \
  > /tmp/agent_dashboard.sql

# 2) Stop the all-in-one
docker compose down

# 3) Bring up the split (control plane only, with empty DB)
docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge            ai_forse_prod
docker compose -p ai_forse_control -f docker-compose.control.yml up -d postgres

# 4) Wait for Postgres to be healthy
docker compose -p ai_forse_control exec -T postgres pg_isready

# 5) Restore
docker compose -p ai_forse_control exec -T postgres psql -U agent_dashboard \
  agent_dashboard < /tmp/agent_dashboard.sql

# 6) Bring up dashboard + fleet
scripts/stack-up.sh up

The dashboard's auto-migration runner (PR #33) is idempotent, so even if the dump came from an older minor it'll catch the schema up.

Multi-host deployment¶

The split topology runs naturally across two physical machines:

┌─ Host A (control) ────────────┐    ┌─ Host B / C / D (fleet) ──────────┐
│                               │    │                                    │
│  docker-compose.control.yml   │◄──►│   docker-compose.fleet.yml         │
│  ai_forse_oobi (local)        │    │   ai_forse_oobi (local, --internal)│
│                               │    │   ai_forse_prod (local, NAT)       │
│                               │    │   CONTROLLER_URL=                  │
│                               │    │     http://host-a.internal:3000    │
└───────────────────────────────┘    └────────────────────────────────────┘

Note on multi-host networks: Docker bridge networks are host-local. In multi-host deployments each host creates its own ai_forse_oobi and ai_forse_prod, and inter-host traffic flows over the host's physical NIC (the agent's CONTROLLER_URL points at a real DNS name / IP). For seamless cross-host service discovery use Docker Swarm overlay networks or, better, Kubernetes — see k8s/.

On host A (control):

docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge            ai_forse_prod
docker compose -p ai_forse_control -f docker-compose.control.yml up -d

On hosts B/C/D (fleet), with .env containing CONTROLLER_URL=http://host-a.internal:3000:

docker network create --driver bridge --internal ai_forse_oobi
docker network create --driver bridge            ai_forse_prod
docker compose -p ai_forse_fleet -f docker-compose.fleet.yml up -d \
  --scale agent=20

Multi-host caveat: the dashboard's local autoscaler talks to the Docker socket of host A, so it can only scale containers running on host A. For multi-host fleets you have three options:

Disable the autoscaler (use docker-compose.no-scaler.yml on the control side) and scale each host manually with --scale agent=N per host.

Use Kubernetes — it's literally the same topology and the K8s HPA scales across nodes for free. See k8s/.

Run a Docker Swarm — same Compose files, just docker stack deploy. The dashboard's scaler doesn't know about Swarm, so item 1 still applies.

Resource budgeting¶

Suggested baselines per topology size:

Fleet size	Control host	Per-fleet host
1–10 agents	1 vCPU, 2 GiB	n/a (same host as control)
10–30 agents	1 vCPU, 2 GiB	4 vCPU, 8 GiB
30–100 agents	2 vCPU, 4 GiB (`shared_buffers=1GB`)	16 vCPU, 32 GiB
100+ agents	4 vCPU, 8 GiB (`shared_buffers=2GB`) + dedicated agent hosts	32 vCPU, 64 GiB each, 1 agent ≈ 250 MiB RSS

Per-agent memory ceiling is enforced by mem_limit: 1g (1 GiB cgroup hard-cap) and pids_limit: 500 (Chromium runaway prevention).

The webserver stack (since v1.7.0)¶

The ai_forse_webserver Compose project deploys a pool of up to 20 Caddy 2.8 containers serving HTTP/2 (TCP) and HTTP/3 (QUIC over UDP) at maximum throughput. Same network model as the agent fleet: two NICs per container, with the OOBI side carrying the dashboard's /metrics scrape and the PROD side carrying real client traffic.

What each webserver serves:

Endpoint	What you get	Why
`/`	Static landing page (4 KB HTML + 1 CSS + 1 JS)	Smoke test
`/static/small`	4 KB binary	Tiny payload baseline
`/static/medium`	100 KB binary	Mid payload — headers compression test
`/static/large`	1 MB binary	Large payload — multiplexing test
`/api/echo`	JSON with protocol + request introspection	"What HTTP version did I just talk?"
`/healthz`	200 OK	Docker `HEALTHCHECK`
`:9091/metrics`	Prometheus format	Dashboard scrape (OOBI only)

Performance posture of the Caddyfile:

HTTP/3 enabled by default via protocols h1 h2 h3
Alt-Svc header advertises h3=":443" so capable clients upgrade
Brotli + zstd + gzip compression with 1 KB minimum
Caddy serves static via sendfile / splice zero-copy
Idle timeout 2 min, read header timeout 10 s
TLS via Caddy's internal CA (self-signed). Switch to ACME by removing local_certs and adding email … in the global block.
Metrics endpoint isolated on :9091 so the high-cardinality scrape doesn't mix with client traffic on :443

Scaling:

# Bring up at default scale (1)
scripts/stack-up.sh up

# Scale to 12 webservers
scripts/stack-up.sh scale-webserver 12

# Or via the dashboard slider in /webservers (added in PR #C)

Why the data plane port isn't published to the host¶

At scale (up to 20 replicas) Compose can't bind 20 containers to host port 443. Other containers on ai_forse_prod (the agents) reach each webserver by its container DNS name, e.g. https://ai_forse_webserver-webserver-3:443/. For browser smoke testing from the host, override one replica to publish a port:

docker compose -p ai_forse_webserver -f docker-compose.webserver.yml \
  run --rm --service-ports -p 8443:443 webserver

The synthetic-load fleet stack (since v3.0.0)¶

The ai_forse_k6 Compose project deploys a pool of up to 1,000 synthetic-load engine load-test agents. Each agent is a lightweight TypeScript + Node.js process (~128 MB RAM) that wraps the official grafana/k6 binary. No browser, no Chromium — just the k6 Go HTTP stack hitting target URLs with configurable VUs and duration.

Network model: identical to the browser-engine fleet — two NICs per container, OOBI side for dashboard API calls, PROD side for outbound test traffic.

Property	Value
Docker project	`ai_forse_k6`
Compose file	`docker-compose.k6-fleet.yml`
Service name	`k6agent`
Max agents (Compose)	1,000
Max agents (K8s HPA)	1,000
RAM per agent	~128 MiB
K8s scale-to-zero	yes (KEDA or Kubernetes ≥ 1.32)

Scaling:

# Bring up synthetic-load fleet with 5 agents
scripts/stack-up.sh up
scripts/stack-up.sh scale-k6 5

# Or via the dashboard slider at /agents/k6

Why the synthetic-load fleet is a separate Compose project (not a second service in docker-compose.fleet.yml):

Independent scaling: browser-engine fleet and synthetic-load fleet have very different capacity characteristics (browser memory vs. CPU threads). Keeping them in separate projects allows scaling each without touching the other.
Independent restart: A synthetic-load agent crash loop doesn't bounce browser engine agents. Operators can scripts/stack-up.sh restart-k6 without interrupting active browser cycles.
Independent image lifecycle: synthetic-load engine image updates don't require rebuilding the browser engine image, and vice versa.

For the full synthetic-load engine operations guide, see docs/K6_FLEET.md.

DUT test-bed mode (physical NGFW)¶

Since the feat/k6-executor-scripts work stream (2026-05), the stack supports an optional physical-DUT extension that places the NGFW under test in the data path for every agent-to-webserver connection. All three agent fleets (browser engine, synthetic-load engine, webservers) are moved from the ai_forse_prod bridge onto per-VLAN macvlan networks on a 802.1q trunk; the NGFW is the L3 gateway for all of them.

DUT topology — additional compose files¶

docker-compose.dut.playwright.yml    # browser engine on VLAN 20 macvlan (ai_forse_dut_pw)
docker-compose.dut.k6.yml           # synthetic-load agents on VLAN 30 macvlan (ai_forse_dut_k6)
docker-compose.dut.observability.yml # Observability overlay: adds snmp_exporter +
                                     #   overrides prometheus to prometheus.dut.yml
scripts/netsetup-dut.sh             # Creates 802.1q subinterfaces + macvlan networks
scripts/stack-up.sh                 # Extended: up-dut / down-dut / destroy-dut / ps-dut
observability/snmp/snmp.yml         # SNMP module definitions (10 vendor modules)
observability/prometheus/prometheus.dut.yml  # Prometheus config for DUT mode
observability/prometheus/targets/ubuntu-hosts.yml  # node_exporter host list (file SD)
observability/grafana/dashboards/nexus9000.json    # Nexus 9000 switch dashboard
observability/grafana/dashboards/ngfw-dut.json     # NGFW DUT dashboard
observability/grafana/dashboards/ubuntu-hosts.json # Ubuntu/UCS host dashboard

DUT network segments¶

                      ai_forse_oobi (--internal bridge)
   ┌──────────────────────────────────────────────────────────────────┐
   │  postgres  dashboard  Prometheus  Grafana  snmp_exporter          │
   └──────────────────────────────────────────────────────────────────┘
                   │ control-plane API calls (priority 100)

   Webservers: K8s persona namespaces only.
   ┌────────────────────────────────────────────────┐
   │  20 Synthetic Persona namespaces (VLANs 101-120, 10.1.x.0/27) │
   │  10 Cloned Persona slots         (VLANs 200-209, 10.2.x.0/27) │
   │  NGFW .1 is default GW on every persona /27.                  │
   │  (Legacy Docker-mode VLAN 10 webserver layer removed v3.7.0.) │
   └────────────────────────────────────────────────┘

   VLAN 20  ai_forse_dut_pw   172.16.0.0/16  macvlan on eth1.20
   ┌────────────────────────────────────────────────┐
   │  browser-engine agents  (NGFW .1 is default GW)   │
   └────────────────────────────────────────────────┘

   VLAN 30  ai_forse_dut_k6   172.17.0.0/16  macvlan on eth1.30
   ┌────────────────────────────────────────────────┐
   │  synthetic-load agents  (NGFW .1 is default GW)           │
   └────────────────────────────────────────────────┘

   VLAN 99  ai_forse_mgmt     192.168.90.0/24  macvlan on eth1.99
   ┌────────────────────────────────────────────────┐
   │  snmp_exporter only — reaches MGMT0 (Nexus)   │
   │  and MGMT interface (NGFW) via SNMP UDP/161    │
   └────────────────────────────────────────────────┘
                     │
              Cisco Nexus 9000
               (L2 trunk: VLANs 10/20/30/99)
                     │
              NGFW DUT (L3 gateway .1 per VLAN)

Each container in the DUT compose files has two NICs: - ai_forse_oobi (priority 100) — control plane: dashboard API, Prometheus scrape. - ai_forse_dut_{web,pw,k6} (priority 1000) — data plane, default route → NGFW.

One-time host setup (Linux only)¶

# Create 802.1q subinterfaces and Docker macvlan networks.
# DUT_DATA_IFACE = the trunk NIC connected to the Nexus 9000.
sudo DUT_DATA_IFACE=eth1 scripts/netsetup-dut.sh setup

# Verify
scripts/netsetup-dut.sh status

Set DUT variables in .env:

SNMP_NEXUS_HOST=192.168.90.2       # Nexus 9000 MGMT0 IP on VLAN 99
SNMP_NGFW_HOST=192.168.90.3        # NGFW management interface IP
SNMP_COMMUNITY=public            # SNMPv2c community string
SNMP_DUT_MODULE=cisco_ftd        # Choose the right SNMP module for your NGFW:
                                 #   cisco_ftd / cisco_iosxe / cisco_meraki /
                                 #   fortinet_fortigate / palo_alto /
                                 #   checkpoint / huawei_ngfw / generic_ngfw

Bring up / down the DUT stack¶

# Bring up all DUT stacks (webservers + agents + observability overlay)
scripts/stack-up.sh up-dut

# Scale synthetic-load engine to 10 agents for load testing
scripts/stack-up.sh scale-k6 10

# Status of all DUT services
scripts/stack-up.sh ps-dut

# Tear down (stops containers; macvlan networks + volumes remain)
scripts/stack-up.sh down-dut

# Full teardown including macvlan networks
scripts/stack-up.sh destroy-dut
sudo DUT_DATA_IFACE=eth1 scripts/netsetup-dut.sh teardown

SNMP monitoring — supported NGFW vendors¶

`SNMP_DUT_MODULE`	Device	Key vendor-specific metrics
`cisco_ftd`	Cisco Firepower FTD 7.x	CISCO-FIREWALL-MIB sessions/CPS, CRYPTO-ACCEL HW engine
`cisco_iosxe`	Cisco C8475-G2 IOS-XE	PROCESS-MIB CPU, MEMORY-POOL-MIB, queue drops
`cisco_meraki`	Meraki MX450	IF-MIB only (Dashboard API needed for CPU/sessions)
`fortinet_fortigate`	FortiGate FG200F/FG600F	`fgSysCpuUsage`, `fgSysSesCount`, `fgSysSesRate1`, VDOMs
`palo_alto`	PA-series / VM-series	`panSessionActive`, `panSysCpuUtilization`, `panSessionThroughput`
`checkpoint`	Quantum / Gaia R81.x	`fwNumConn`, `svnPerfCPU`, `svnPerfMem`
`huawei_ngfw`	Huawei USG / NGFW	`hwEntityCPUUsage`, `hwEntityMemUsage`, temperature
`generic_ngfw`	Any vendor	Standard MIBs: IF-MIB, HOST-RESOURCES-MIB

All modules share a common set of IF-MIB metrics (ifHCInOctets, ifHCOutOctets, ifOperStatus, dot3StatsFCSErrors) that work with the NGFW DUT Grafana dashboard regardless of vendor.

Ubuntu host monitoring (Cisco UCS)¶

Physical Ubuntu hosts (Cisco UCS servers) running on the MGMT network are monitored with node_exporter:

# Deploy node_exporter on each Ubuntu/UCS host (run as root on the host)
docker run -d --name node_exporter --net host --pid host \
  --restart unless-stopped \
  -v /:/host:ro,rslave \
  prom/node-exporter:v1.8.2 \
  --path.rootfs=/host \
  --collector.systemd \
  --collector.processes

Then add the host IP to observability/prometheus/targets/ubuntu-hosts.yml:

- targets:
    - 192.168.90.10:9100   # ucs-host-01
    - 192.168.90.11:9100   # ucs-host-02
  labels:
    role: ubuntu_host
    rack: ucs

Prometheus picks up the file every 30 s — no restart needed.

The Ubuntu Hosts (UCS) Grafana dashboard (dut-ubuntu-hosts) shows: CPU modes, load averages, memory breakdown, swap, disk I/O, IOPS, filesystem usage, network bps/pps/errors, open FDs, processes, and Docker container count.

For UCS chassis-level hardware health (CIMC temperature, fans, PSU), enable SNMP on each server's CIMC and add a separate scrape job using the generic_ngfw module pointing at the CIMC management IP.

When NOT to split¶

Laptop dev with docker-compose up -d, two browser tabs and a half-baked target list — the all-in-one is faster and there's no real isolation benefit.
CI smoke tests that only need a 60-second sanity check.
Demos where you need a single-command up.

For everything else (production, staging, or "I want to scale past 20 agents"), prefer the split.

Production Ubuntu host tuning¶

Once per host, run the Ubuntu performance tuning script — it applies the UDP/TCP buffer sizes the Caddy webserver fleet needs for HTTP/2 + HTTP/3 throughput, switches the congestion control to BBR with fq qdisc, and bumps the NIC ring buffers:

sudo scripts/tune-ubuntu-host.sh
scripts/stack-up.sh restart-webserver

Details and rationale per knob in docs/PERFORMANCE_TUNING_HOST.md. Without this the Caddy boot log shows failed to sufficiently increase receive buffer size and HTTP/3 drops packets under load.

Stack 6 — Public Website Cloner (`docker-compose.cloner.yml`)¶

The cloner is an optional, isolated stack that downloads real public websites using headful stealth browser engine and serves the static mirror to the persona fleet.

What it does¶

Admin creates a clone job via the Dashboard API (POST /api/clone/jobs {url, personaName})
The cloner pod claims the job atomically (PostgreSQL SKIP LOCKED)
browser engine navigates the target URL with anti-bot stealth, captures all assets
Assets are written to /mnt/cloned/{personaName}/ (PVC in K8s, host volume in Compose)
The clone-serve HTTP server (:8081) serves the mirror: /{personaName}/index.html
browser-engine fleet tests can point at http://clone-serve:8081/{persona}/ through the NGFW

Dual-network isolation¶

ai_forse_oobi  ──►  Dashboard API (management traffic, no internet)
ai_forse_isp   ──►  Public internet (site downloads, ICMP health pings)

The ai_forse_isp network is a standard Docker bridge without internal: true, so it has NAT internet access. On Linux the host must have IP forwarding enabled (net.ipv4.ip_forward=1).

Internet health monitor¶

The cloner pings 8.8.8.8 and 1.1.1.1 every 10 seconds and exposes the results as Prometheus metrics at :8081/metrics. The TLSStress.Art Grafana dashboard shows a live green/red "Acesso Internet (ISP)" indicator.

Bring up / down the cloner¶

# Prerequisites: control stack must be running (dashboard + postgres)
CONTROLLER_TOKEN=<same-as-AGENT_API_TOKEN> \
docker compose -f docker-compose.cloner.yml up -d

# Logs
docker compose -f docker-compose.cloner.yml logs -f cloner

# Check internet health metrics
curl http://localhost:8081/metrics | grep cloner_

# Stop
docker compose -f docker-compose.cloner.yml down

Split-stack topology (6-stack model)¶

Why split?¶

Network segmentation (since v1.6.2 + v3.0.0 synthetic-load engine)¶

What stays the same¶

Files¶

One-time setup¶

Migration from v1.6.1's single network¶

Bring up the stack¶

Day-to-day operations¶

Migrating from the all-in-one¶

Option A — fresh start (recommended for non-prod)¶

Option B — preserve the database¶

Multi-host deployment¶

Resource budgeting¶

The webserver stack (since v1.7.0)¶

Why the data plane port isn't published to the host¶

The synthetic-load fleet stack (since v3.0.0)¶

DUT test-bed mode (physical NGFW)¶

DUT topology — additional compose files¶

DUT network segments¶

One-time host setup (Linux only)¶

Bring up / down the DUT stack¶

SNMP monitoring — supported NGFW vendors¶

Ubuntu host monitoring (Cisco UCS)¶

When NOT to split¶

Production Ubuntu host tuning¶

See also¶

Stack 6 — Public Website Cloner (`docker-compose.cloner.yml`)¶

What it does¶

Dual-network isolation¶

Internet health monitor¶

Bring up / down the cloner¶

See also¶

Split-stack topology (6-stack model)¶

Why split?¶

Network segmentation (since v1.6.2 + v3.0.0 synthetic-load engine)¶

What stays the same¶

Files¶

One-time setup¶

Migration from v1.6.1's single network¶

Bring up the stack¶

Day-to-day operations¶

Migrating from the all-in-one¶

Option A — fresh start (recommended for non-prod)¶

Option B — preserve the database¶

Multi-host deployment¶

Resource budgeting¶

The webserver stack (since v1.7.0)¶

Why the data plane port isn't published to the host¶

The synthetic-load fleet stack (since v3.0.0)¶

DUT test-bed mode (physical NGFW)¶

DUT topology — additional compose files¶

DUT network segments¶

One-time host setup (Linux only)¶

Bring up / down the DUT stack¶

SNMP monitoring — supported NGFW vendors¶

Ubuntu host monitoring (Cisco UCS)¶

When NOT to split¶

Production Ubuntu host tuning¶

See also¶

Stack 6 — Public Website Cloner (docker-compose.cloner.yml)¶

What it does¶

Dual-network isolation¶

Internet health monitor¶

Bring up / down the cloner¶

See also¶

Stack 6 — Public Website Cloner (`docker-compose.cloner.yml`)¶