TLSStress.Art — Dual-Node Deployment on 2 UCS Servers¶

Deployment mode: dual-node — UCS-1 runs the agent fleet (browser-engine + synthetic-load) and UCS-2 runs everything else (personas, services, observability). For a single-server setup see UBUNTU_K3S_SINGLENODE_QUICKSTART_DEPLOY.en.md; for the four-server distributed topology see UBUNTU_K3S_MULTINODE_QUICKSTART_DEPLOY.en.md.

Last verified against shipping code: v3.7.0 (2026-05-12) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture + the ZTP-prem 12/12 camadas insider-operator posture (25 patent claims, Tier A/B partition, Confidential Computing, sealed audit hash-chain, K8s admission webhook, TPM 2.0 measured-boot, DLP egress monitor, behavioural anomaly detector). ADRs 0014, 0019-0025 cover post-Freeze additions.

Goal: Deploy a TLSStress.Art across exactly two Cisco UCS servers — one dedicated to load generation, one dedicated to webservers + services + observability — to provide an intermediate scale point between single-node and four-server multi-node.

Who this is for: Network and systems engineers running an NGFW test-bed who have two UCS chassis available and want hardware isolation between the load generators and the webservers being measured.

Time estimate: 60–90 minutes for initial cluster setup; 15 minutes for teardown + redeploy.

Automated installation (recommended)¶

Run the install script once per server in this exact order (UCS-2 first because it hosts the k3s control plane):

# Clone the repository on each server
git clone https://github.com/nollagluiz/AI_forSE.git && cd AI_forSE

# UCS-2 — k3s server + personas + services + observability
sudo ./scripts/k8s-install.sh --mode=dual-server --data-iface=eth1
# ↑ Prints JOIN_TOKEN and SERVER_IP at the end — save them for UCS-1.

# UCS-1 — browser-engine + synthetic-load agents
sudo ./scripts/k8s-install.sh --mode=dual-agent \
     --server-ip=<UCS-2-OOBI-IP> --token=<JOIN_TOKEN> --data-iface=eth1

# Back on UCS-2 — apply all Kubernetes manifests
sudo ./scripts/k8s-install.sh --mode=dual-apply

Dry run: append --dry-run to any command to preview without making changes.

The rest of this guide explains each step in detail and is useful for customising the setup or troubleshooting the automated installation.

Table of contents¶

Why dual-node
Architecture overview
OOBI network — mandatory on both UCS
Prerequisites per server
Step 1 — Host preparation (both UCS)
Step 2 — k3s cluster bootstrap
Step 3 — VLAN setup per node
Step 4 — Apply manifests
Step 5 — Verify the deployment
Observability — Grafana and Prometheus in dual-node
Troubleshooting
Reference — overlay file

1. Why dual-node¶

Single-node is the simplest layout but every workload competes for the same CPU, memory and NIC queues — under load you cannot tell whether the bottleneck is the NGFW or the test bed itself. Multi-node fixes that completely but requires four UCS chassis.

Dual-node is the middle ground:

Hardware isolation: load generators (browser-engine + synthetic-load) and webservers (personas + cloned-personas) run on different physical UCS chassis, so the only contention path between them is the NGFW under test — exactly the variable being measured.
Two roles instead of four: half the rack space, half the cabling, half the operational overhead of multi-node.
Same observability: identical Grafana dashboards and Prometheus alerts as single- and multi-node — node_exporter runs on both UCS via DaemonSet, and the test-bed-bottleneck alert auto-adapts to the two-node topology.

Pick dual-node when you have two chassis and want clean NGFW measurements without standing up a four-server cluster.

2. Architecture overview¶

Physical topology¶

                ┌──────────────────────────────────────────────────┐
                │      OOBI network — eth0 on BOTH UCS              │
                │             10.0.0.0/24 (example)                 │
                │   k3s API :6443 · kubelet · Prometheus scrape     │
                └────────┬──────────────────────┬───────────────────┘
                         │                      │
                ┌────────┴────────┐    ┌────────┴────────────────────┐
                │     UCS-1        │    │           UCS-2              │
                │   role=agents    │    │       role=ngfw-dut          │
                │                  │    │                              │
                │   browser engine     │    │   20 Caddy persona pods      │
                │   synthetic-load engine             │    │   10 cloned-persona slots    │
                │                  │    │   Dashboard · Postgres       │
                │                  │    │   PgBouncer · Cloner · NFS   │
                │                  │    │   Grafana · Prometheus       │
                │                  │    │   SNMP exporter              │
                │  eth0 (OOBI)     │    │   eth0 (OOBI)                │
                │  eth1 (trunk)    │    │   eth1 (trunk)               │
                └────────┬─────────┘    └────────────┬─────────────────┘
                         │                           │
                  VLAN 20 (172.16/16)        VLAN 99 mgmt (192.168.90/24)
                  VLAN 30 (172.17/16)        VLAN 40 ISP (DHCP via macvlan)
                                             VLANs 101–120 (10.1.x/27)
                                             VLANs 200–209 (10.2.x/27)
                         │                           │
                         └─────────────┬─────────────┘
                                       │
                          ┌────────────┴────────────┐
                          │   Cisco Nexus 9000      │
                          │   VLAN trunk · ECMP     │
                          │   DSCP AF41 · MTU 9216  │
                          └────────────┬────────────┘
                                       │
                          ┌────────────┴────────────┐
                          │  NGFW (Device Under Test)│
                          │  TLS leg-1: agents→NGFW │
                          │  TLS leg-2: NGFW→Caddy  │
                          └─────────────────────────┘

Workload placement¶

Workload	UCS-1 (`role=agents`)	UCS-2 (`role=ngfw-dut`)
browser-engine agents (`web-agent`)	✓	—
synthetic-load agents (`k6-agent`)	✓	—
Synthetic personas (20× Caddy)	—	✓
Cloned-persona slots (10× Caddy)	—	✓
Dashboard, Postgres, PgBouncer	—	✓
Cloner	—	✓
NFS server (cloned-sites)	—	✓
SNMP exporter	—	✓
Prometheus + Grafana	—	✓
`node_exporter` (per-host metrics)	✓ DaemonSet	✓ DaemonSet
`node-tuning` (sysctl + BBR + CPU governor)	✓ DaemonSet (`dut-data-plane=true`)	✓ DaemonSet (`dut-data-plane=true`)
`cni-dhcp` (cloner ISP DHCP)	—	✓ DaemonSet

Why `role=ngfw-dut` on UCS-2 (instead of `role=infra`)¶

The synthetic personas (personas/_generated/*/deployment.yaml), cloned-personas (k8s/clone-personas/20-slots.yaml) and SNMP exporter (k8s/dut/60-snmp-exporter.yaml) all hardcode nodeSelector: role=ngfw-dut. Reusing that label on UCS-2 avoids touching twenty-something manifests. The dual-node overlay then patches the additional service workloads (Dashboard, Postgres, PgBouncer, Cloner) — which the multi-node overlay sends to role=infra — onto the same role=ngfw-dut node.

3. OOBI network — mandatory on both UCS¶

The OOBI (out-of-band infrastructure) network is the dedicated control-plane segment carried over eth0 on every UCS, regardless of deployment mode.

What runs on OOBI¶

k3s API server (:6443) — UCS-1 reaches it on UCS-2 to join the cluster
kubelet (:10250) — UCS-2 scrapes UCS-1 for pod status
flannel (CNI) — pod-to-pod traffic across nodes uses VXLAN over eth0
Prometheus scrape — node_exporter, cAdvisor, kube-state-metrics are all reached over eth0
cert-manager renewals — webhook + ACME challenge traffic
Dashboard / kubectl — operator access

What does NOT run on OOBI¶

NGFW test traffic — that is on eth1 VLAN trunk (data plane)
SNMP polling — VLAN 99 (192.168.90.0/24) on eth1.99
Cloner internet egress — VLAN 40 on eth1.40

Requirements per UCS¶

	UCS-1 (agents)	UCS-2 (services)
eth0 IP	static, OOBI subnet	static, OOBI subnet
Reachable from peer eth0	✓ (kubelet ↔ control plane)	✓
MTU	1500 (default)	1500 (default)
Default route via OOBI	✓ — internet egress for image pulls and k3s install	✓

If eth0 is missing on either UCS, k3s flannel cannot establish the control plane, UCS-1 cannot join the cluster, and Prometheus on UCS-2 cannot scrape host metrics from UCS-1. The pre-flight check in k8s-install.sh rejects both dual-server and dual-agent if eth0 is absent.

4. Prerequisites per server¶

Hardware¶

	UCS-1	UCS-2
CPU	≥ 16 physical cores	≥ 32 physical cores
RAM	≥ 64 GB	≥ 128 GB
NICs	eth0 (OOBI) + eth1 (trunk to Nexus)	eth0 (OOBI) + eth1 (trunk to Nexus)
Disk	≥ 200 GB free on `/`	≥ 500 GB free on `/` (NFS + Postgres + Prometheus retention)

UCS-2 is the heavier node — it runs personas, all services, and the observability stack. Plan for 2× the resources of UCS-1.

Operating system¶

Ubuntu 22.04 LTS or 24.04 LTS, amd64
root access (or sudo)
internet access via OOBI default route during install (k3s + Helm + container images)

Network — Nexus 9000¶

VLANs that must be configured on the trunk to both UCS:

VLAN	UCS-1	UCS-2	Subnet	Use
20	✓	—	172.16.0.0/16	browser-engine agents
30	✓	—	172.17.0.0/16	synthetic-load agents
40	—	✓	DHCP from ISP	Cloner internet egress (macvlan)
99	—	✓	192.168.90.0/24	SNMP polling — Nexus (.2), NGFW (.3)
101–120	—	✓	10.1.{1..20}.0/27	Synthetic persona webservers
200–209	—	✓	10.2.{1..10}.0/27	Cloned persona slots

Strictly: VLAN 20 and VLAN 30 must arrive at UCS-1 only; VLANs 40, 99, 101–120 and 200–209 must arrive at UCS-2 only. The Nexus trunk can carry all of them and let each side ignore the ones it does not consume — the trunk is identical for the two ports.

Step 1 — Host preparation (both UCS)¶

Identical preparation on UCS-1 and UCS-2:

sudo apt-get update
sudo apt-get install -y --no-install-recommends \
  curl git jq openssl ca-certificates gnupg lsb-release \
  vlan iproute2 ethtool net-tools

# Verify both NICs
ip link show eth0
ip link show eth1

Then run the project's host-tuning script on both UCS — sysctls for high-fan-out QUIC + TCP, BBR + FQ, CPU governor, conntrack, ports:

sudo scripts/host-tuning.sh apply
sudo scripts/host-tuning.sh status

The apply step is idempotent and persists via systemd-sysctl drop-ins. On every measurement run the dashboard's TestBedSysctlMissing alert will fire if either UCS regresses to kernel defaults.

Step 2 — k3s cluster bootstrap¶

UCS-2 first (it is the k3s server)¶

sudo ./scripts/k8s-install.sh --mode=dual-server --data-iface=eth1

The script:

Pre-flights eth0 + eth1 + ISP interface
Creates VLAN subinterfaces 40, 99, 101–120, 200–209 on eth1
Brings up the eth1.40 ISP subinterface and persists it via netplan
Installs k3s server, binds the API to eth0, disables Traefik and ServiceLB
Installs Helm, cert-manager, Multus
Labels the node role=ngfw-dut + dut-data-plane=true
Prints the JOIN_TOKEN and SERVER_IP

Save the printed credentials — UCS-1 needs them.

UCS-1 (agents)¶

sudo ./scripts/k8s-install.sh --mode=dual-agent \
     --server-ip=<UCS-2-OOBI-IP> --token=<JOIN_TOKEN> --data-iface=eth1

The script:

Pre-flights eth0 + eth1 (both mandatory)
Creates VLAN subinterfaces 20, 30 on eth1
Joins k3s as agent over eth0 → UCS-2:6443
Labels itself role=agents + dut-data-plane=true

Verify from UCS-2:

kubectl get nodes -o wide
# Expect:
# NAME    STATUS  ROLES   AGE  VERSION  LABELS
# ucs-1   Ready   agent   1m   v1.31.x  role=agents,dut-data-plane=true,...
# ucs-2   Ready   master  3m   v1.31.x  role=ngfw-dut,dut-data-plane=true,...

Step 3 — VLAN setup per node¶

The install script performs this automatically. To verify:

UCS-1:

ip -br link | grep eth1\\.
# eth1.20  UP  (172.16.0.1/16)
# eth1.30  UP  (172.17.0.1/16)

UCS-2:

ip -br link | grep eth1\\.
# eth1.40  UP  (no IP — DHCP via cloner pod macvlan)
# eth1.99  UP  (192.168.90.1/24)
# eth1.101 UP  (10.1.1.1/27)
# … through eth1.120 (10.1.20.1/27)
# eth1.200 UP  (10.2.1.1/27)
# … through eth1.209 (10.2.10.1/27)

VLAN persistence across reboots is handled by netplan (script writes /etc/netplan/99-*.yaml).

Step 4 — Apply manifests¶

From UCS-2:

# Create the NGFW CA configmap (once)
kubectl create configmap ngfw-ca -n web-agents --from-file=ngfw-ca.crt=<path-to-cert>

# Create application secrets from .env
kubectl create secret generic web-agent-secrets -n web-agents --from-env-file=.env

# Apply everything
sudo ./scripts/k8s-install.sh --mode=dual-apply

This applies:

kubectl apply -k overlays/dual-node/ — base k8s manifests with browser-engine/synthetic-load pinned to role=agents and Dashboard/Postgres/PgBouncer/Cloner pinned to role=ngfw-dut
kubectl apply -k k8s/dut/ — DUT overlay (NADs, SNMP probes, NFS server, node-tuning DaemonSet, cAdvisor ServiceMonitor, infra Prometheus rules)
Patches node-tuning DaemonSet nodeSelector to dut-data-plane=true (runs on both UCS)
Patches browser-engine + synthetic-load deployments via 40-playwright-patch.yaml / 50-k6-patch.yaml — Multus net1 macvlan annotation, NGFW CA trust, REJECT_INVALID_CERTS=true
Re-pins browser-engine + synthetic-load to role=agents (the patch files default to role=ngfw-dut for single-node compatibility — dual-node overrides)
Waits for all pods to become Ready

Step 5 — Verify the deployment¶

# Pods on UCS-1 (agents)
kubectl get pods -n web-agents -o wide --field-selector spec.nodeName=ucs-1
# Expect: web-agent-* and k6-agent-* only

# Pods on UCS-2 (services + personas)
kubectl get pods -n web-agents -o wide --field-selector spec.nodeName=ucs-2
# Expect: dashboard, postgres, pgbouncer, cloner, nfs-server,
#         persona-shop-*, persona-news-*, …, clone-persona-1-*, …

# Synthetic + cloned personas in their own namespaces
kubectl get pods --all-namespaces -o wide | grep -E "persona-|clone-persona-"

# Multus net1 attached on agents
kubectl exec -n web-agents deploy/web-agent -- ip -br addr | grep net1
# Expect a 172.16.x.x address

# Check NFS server is reachable from cloner
kubectl exec -n web-agents deploy/cloner -- mount | grep nfs

# End-to-end smoke test — agents hit a persona via the NGFW
kubectl exec -n web-agents deploy/web-agent -- \
  curl -sf https://shop.persona.local/ | head -1

6. Observability — Grafana and Prometheus in dual-node¶

Observability runs entirely on UCS-2 (role=ngfw-dut):

Prometheus — scrapes both UCS over OOBI (eth0) at port :9100 for node_exporter, kubelet :10250 for cAdvisor, and :8080 for kube-state-metrics
Grafana — Test-Bed Infrastructure Health dashboard auto-adapts to two-node topology via per-host filters and count(node_exporter) recording rules
Alerts — the composite TestBedInfrastructureBottleneck, host-level (HostUDPBufferOverflow, HostConntrackNearFull, etc.) and pod-level (PodCPUThrottled, OOMKilled) alerts all work out-of-the-box. See MONITORING_TEST_VALIDITY.md for the full alert catalogue

The node_exporter DaemonSet has no nodeSelector and tolerations: operator: Exists, so it covers both UCS without further configuration. The NodeExporterCoverageIncomplete alert fires automatically if either node stops reporting host metrics.

To open Grafana from your operator workstation (assuming you have kubectl access to UCS-2):

kubectl port-forward -n web-agents svc/grafana 3000:3000
# then visit http://localhost:3000

7. Troubleshooting¶

UCS-1 fails to join the cluster¶

Symptoms: dual-agent install hangs at Joining k3s cluster as agent.

# On UCS-1 — check OOBI reachability to UCS-2
ping <UCS-2-OOBI-IP>
nc -vz <UCS-2-OOBI-IP> 6443

# On UCS-1 — check k3s-agent service status and logs
sudo systemctl status k3s-agent
sudo journalctl -u k3s-agent -n 100

Most common causes: OOBI subnet mismatch, firewall blocking 6443, or wrong JOIN_TOKEN.

browser-engine pods stay in `Pending`¶

kubectl describe pod -n web-agents -l app=web-agent | grep -A4 Events

If you see node(s) didn't match Pod's node affinity/selector, the agent re-pinning step did not run. Apply it manually:

kubectl patch deployment web-agent -n web-agents --type=strategic-merge-patch \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"role":"agents"}}}}}'
kubectl patch deployment k6-agent -n web-agents --type=strategic-merge-patch \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"role":"agents"}}}}}'

Personas stay in `Pending`¶

kubectl describe pod -n persona-shop persona-shop-* | grep -A4 Events

If node(s) didn't match again, UCS-2 is not labeled role=ngfw-dut. Re-label:

kubectl label node ucs-2 role=ngfw-dut dut-data-plane=true --overwrite

Cloner cannot reach the internet¶

Cloner uses eth1.40 macvlan with DHCP. Verify on UCS-2:

ip link show eth1.40
sudo dhclient -v eth1.40   # only as a manual diagnostic — pod has its own DHCP

If eth1.40 is missing, re-run setup_isp_iface (it is part of dual-server).

`node_exporter` only reports one node¶

kubectl get pods -n web-agents -l app.kubernetes.io/name=node-exporter -o wide

Both ucs-1 and ucs-2 should appear. If not, check:

kubectl describe daemonset node-exporter -n web-agents | grep -A3 Tolerations

Tolerations should be operator: Exists (matches every taint). If a node has a custom taint blocking it, add a toleration or remove the taint.

8. Reference — overlay file¶

The dual-node overlay lives in overlays/dual-node/kustomization.yaml. It inherits from k8s/ and adds six strategic-merge patches:

patches:
  # browser-engine + K6 → UCS-1
  - target: { kind: Deployment, name: web-agent }   # role=agents
  - target: { kind: Deployment, name: k6-agent }    # role=agents
  # Services → UCS-2 (reuses role=ngfw-dut)
  - target: { kind: Deployment,  name: dashboard }   # role=ngfw-dut
  - target: { kind: StatefulSet, name: postgres }    # role=ngfw-dut
  - target: { kind: Deployment,  name: pgbouncer }   # role=ngfw-dut
  - target: { kind: Deployment,  name: cloner }      # role=ngfw-dut

To apply manually (without the install script):

kubectl apply -k overlays/dual-node/
kubectl apply -k k8s/dut/
kubectl patch daemonset node-tuning -n web-agents \
  --type=strategic-merge-patch \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"dut-data-plane":"true"}}}}}'