Skip to content

TLSStress.Art — Dual-Node Deployment on 2 UCS Servers

Deployment mode: dual-node — UCS-1 runs the agent fleet (browser-engine + synthetic-load) and UCS-2 runs everything else (personas, services, observability). For a single-server setup see UBUNTU_K3S_SINGLENODE_QUICKSTART_DEPLOY.en.md; for the four-server distributed topology see UBUNTU_K3S_MULTINODE_QUICKSTART_DEPLOY.en.md.

Last verified against shipping code: v3.7.0 (2026-05-12) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture + the ZTP-prem 12/12 camadas insider-operator posture (25 patent claims, Tier A/B partition, Confidential Computing, sealed audit hash-chain, K8s admission webhook, TPM 2.0 measured-boot, DLP egress monitor, behavioural anomaly detector). ADRs 0014, 0019-0025 cover post-Freeze additions.

Goal: Deploy a TLSStress.Art across exactly two Cisco UCS servers — one dedicated to load generation, one dedicated to webservers + services + observability — to provide an intermediate scale point between single-node and four-server multi-node.

Who this is for: Network and systems engineers running an NGFW test-bed who have two UCS chassis available and want hardware isolation between the load generators and the webservers being measured.

Time estimate: 60–90 minutes for initial cluster setup; 15 minutes for teardown + redeploy.


Run the install script once per server in this exact order (UCS-2 first because it hosts the k3s control plane):

# Clone the repository on each server
git clone https://github.com/nollagluiz/AI_forSE.git && cd AI_forSE

# UCS-2 — k3s server + personas + services + observability
sudo ./scripts/k8s-install.sh --mode=dual-server --data-iface=eth1
# ↑ Prints JOIN_TOKEN and SERVER_IP at the end — save them for UCS-1.

# UCS-1 — browser-engine + synthetic-load agents
sudo ./scripts/k8s-install.sh --mode=dual-agent \
     --server-ip=<UCS-2-OOBI-IP> --token=<JOIN_TOKEN> --data-iface=eth1

# Back on UCS-2 — apply all Kubernetes manifests
sudo ./scripts/k8s-install.sh --mode=dual-apply

Dry run: append --dry-run to any command to preview without making changes.

The rest of this guide explains each step in detail and is useful for customising the setup or troubleshooting the automated installation.


Table of contents

  1. Why dual-node
  2. Architecture overview
  3. OOBI network — mandatory on both UCS
  4. Prerequisites per server
  5. Step 1 — Host preparation (both UCS)
  6. Step 2 — k3s cluster bootstrap
  7. Step 3 — VLAN setup per node
  8. Step 4 — Apply manifests
  9. Step 5 — Verify the deployment
  10. Observability — Grafana and Prometheus in dual-node
  11. Troubleshooting
  12. Reference — overlay file

1. Why dual-node

Single-node is the simplest layout but every workload competes for the same CPU, memory and NIC queues — under load you cannot tell whether the bottleneck is the NGFW or the test bed itself. Multi-node fixes that completely but requires four UCS chassis.

Dual-node is the middle ground:

  • Hardware isolation: load generators (browser-engine + synthetic-load) and webservers (personas + cloned-personas) run on different physical UCS chassis, so the only contention path between them is the NGFW under test — exactly the variable being measured.
  • Two roles instead of four: half the rack space, half the cabling, half the operational overhead of multi-node.
  • Same observability: identical Grafana dashboards and Prometheus alerts as single- and multi-node — node_exporter runs on both UCS via DaemonSet, and the test-bed-bottleneck alert auto-adapts to the two-node topology.

Pick dual-node when you have two chassis and want clean NGFW measurements without standing up a four-server cluster.


2. Architecture overview

Physical topology

                ┌──────────────────────────────────────────────────┐
                │      OOBI network — eth0 on BOTH UCS              │
                │             10.0.0.0/24 (example)                 │
                │   k3s API :6443 · kubelet · Prometheus scrape     │
                └────────┬──────────────────────┬───────────────────┘
                         │                      │
                ┌────────┴────────┐    ┌────────┴────────────────────┐
                │     UCS-1        │    │           UCS-2              │
                │   role=agents    │    │       role=ngfw-dut          │
                │                  │    │                              │
                │   browser engine     │    │   20 Caddy persona pods      │
                │   synthetic-load engine             │    │   10 cloned-persona slots    │
                │                  │    │   Dashboard · Postgres       │
                │                  │    │   PgBouncer · Cloner · NFS   │
                │                  │    │   Grafana · Prometheus       │
                │                  │    │   SNMP exporter              │
                │  eth0 (OOBI)     │    │   eth0 (OOBI)                │
                │  eth1 (trunk)    │    │   eth1 (trunk)               │
                └────────┬─────────┘    └────────────┬─────────────────┘
                         │                           │
                  VLAN 20 (172.16/16)        VLAN 99 mgmt (192.168.90/24)
                  VLAN 30 (172.17/16)        VLAN 40 ISP (DHCP via macvlan)
                                             VLANs 101–120 (10.1.x/27)
                                             VLANs 200–209 (10.2.x/27)
                         │                           │
                         └─────────────┬─────────────┘
                                       │
                          ┌────────────┴────────────┐
                          │   Cisco Nexus 9000      │
                          │   VLAN trunk · ECMP     │
                          │   DSCP AF41 · MTU 9216  │
                          └────────────┬────────────┘
                                       │
                          ┌────────────┴────────────┐
                          │  NGFW (Device Under Test)│
                          │  TLS leg-1: agents→NGFW │
                          │  TLS leg-2: NGFW→Caddy  │
                          └─────────────────────────┘

Workload placement

Workload UCS-1 (role=agents) UCS-2 (role=ngfw-dut)
browser-engine agents (web-agent)
synthetic-load agents (k6-agent)
Synthetic personas (20× Caddy)
Cloned-persona slots (10× Caddy)
Dashboard, Postgres, PgBouncer
Cloner
NFS server (cloned-sites)
SNMP exporter
Prometheus + Grafana
node_exporter (per-host metrics) ✓ DaemonSet ✓ DaemonSet
node-tuning (sysctl + BBR + CPU governor) ✓ DaemonSet (dut-data-plane=true) ✓ DaemonSet (dut-data-plane=true)
cni-dhcp (cloner ISP DHCP) ✓ DaemonSet

Why role=ngfw-dut on UCS-2 (instead of role=infra)

The synthetic personas (personas/_generated/*/deployment.yaml), cloned-personas (k8s/clone-personas/20-slots.yaml) and SNMP exporter (k8s/dut/60-snmp-exporter.yaml) all hardcode nodeSelector: role=ngfw-dut. Reusing that label on UCS-2 avoids touching twenty-something manifests. The dual-node overlay then patches the additional service workloads (Dashboard, Postgres, PgBouncer, Cloner) — which the multi-node overlay sends to role=infra — onto the same role=ngfw-dut node.


3. OOBI network — mandatory on both UCS

The OOBI (out-of-band infrastructure) network is the dedicated control-plane segment carried over eth0 on every UCS, regardless of deployment mode.

What runs on OOBI

  • k3s API server (:6443) — UCS-1 reaches it on UCS-2 to join the cluster
  • kubelet (:10250) — UCS-2 scrapes UCS-1 for pod status
  • flannel (CNI) — pod-to-pod traffic across nodes uses VXLAN over eth0
  • Prometheus scrapenode_exporter, cAdvisor, kube-state-metrics are all reached over eth0
  • cert-manager renewals — webhook + ACME challenge traffic
  • Dashboard / kubectl — operator access

What does NOT run on OOBI

  • NGFW test traffic — that is on eth1 VLAN trunk (data plane)
  • SNMP polling — VLAN 99 (192.168.90.0/24) on eth1.99
  • Cloner internet egress — VLAN 40 on eth1.40

Requirements per UCS

UCS-1 (agents) UCS-2 (services)
eth0 IP static, OOBI subnet static, OOBI subnet
Reachable from peer eth0 ✓ (kubelet ↔ control plane)
MTU 1500 (default) 1500 (default)
Default route via OOBI ✓ — internet egress for image pulls and k3s install

If eth0 is missing on either UCS, k3s flannel cannot establish the control plane, UCS-1 cannot join the cluster, and Prometheus on UCS-2 cannot scrape host metrics from UCS-1. The pre-flight check in k8s-install.sh rejects both dual-server and dual-agent if eth0 is absent.


4. Prerequisites per server

Hardware

UCS-1 UCS-2
CPU ≥ 16 physical cores ≥ 32 physical cores
RAM ≥ 64 GB ≥ 128 GB
NICs eth0 (OOBI) + eth1 (trunk to Nexus) eth0 (OOBI) + eth1 (trunk to Nexus)
Disk ≥ 200 GB free on / ≥ 500 GB free on / (NFS + Postgres + Prometheus retention)

UCS-2 is the heavier node — it runs personas, all services, and the observability stack. Plan for 2× the resources of UCS-1.

Operating system

  • Ubuntu 22.04 LTS or 24.04 LTS, amd64
  • root access (or sudo)
  • internet access via OOBI default route during install (k3s + Helm + container images)

Network — Nexus 9000

VLANs that must be configured on the trunk to both UCS:

VLAN UCS-1 UCS-2 Subnet Use
20 172.16.0.0/16 browser-engine agents
30 172.17.0.0/16 synthetic-load agents
40 DHCP from ISP Cloner internet egress (macvlan)
99 192.168.90.0/24 SNMP polling — Nexus (.2), NGFW (.3)
101–120 10.1.{1..20}.0/27 Synthetic persona webservers
200–209 10.2.{1..10}.0/27 Cloned persona slots

Strictly: VLAN 20 and VLAN 30 must arrive at UCS-1 only; VLANs 40, 99, 101–120 and 200–209 must arrive at UCS-2 only. The Nexus trunk can carry all of them and let each side ignore the ones it does not consume — the trunk is identical for the two ports.


Step 1 — Host preparation (both UCS)

Identical preparation on UCS-1 and UCS-2:

sudo apt-get update
sudo apt-get install -y --no-install-recommends \
  curl git jq openssl ca-certificates gnupg lsb-release \
  vlan iproute2 ethtool net-tools

# Verify both NICs
ip link show eth0
ip link show eth1

Then run the project's host-tuning script on both UCS — sysctls for high-fan-out QUIC + TCP, BBR + FQ, CPU governor, conntrack, ports:

sudo scripts/host-tuning.sh apply
sudo scripts/host-tuning.sh status

The apply step is idempotent and persists via systemd-sysctl drop-ins. On every measurement run the dashboard's TestBedSysctlMissing alert will fire if either UCS regresses to kernel defaults.


Step 2 — k3s cluster bootstrap

UCS-2 first (it is the k3s server)

sudo ./scripts/k8s-install.sh --mode=dual-server --data-iface=eth1

The script:

  1. Pre-flights eth0 + eth1 + ISP interface
  2. Creates VLAN subinterfaces 40, 99, 101–120, 200–209 on eth1
  3. Brings up the eth1.40 ISP subinterface and persists it via netplan
  4. Installs k3s server, binds the API to eth0, disables Traefik and ServiceLB
  5. Installs Helm, cert-manager, Multus
  6. Labels the node role=ngfw-dut + dut-data-plane=true
  7. Prints the JOIN_TOKEN and SERVER_IP

Save the printed credentials — UCS-1 needs them.

UCS-1 (agents)

sudo ./scripts/k8s-install.sh --mode=dual-agent \
     --server-ip=<UCS-2-OOBI-IP> --token=<JOIN_TOKEN> --data-iface=eth1

The script:

  1. Pre-flights eth0 + eth1 (both mandatory)
  2. Creates VLAN subinterfaces 20, 30 on eth1
  3. Joins k3s as agent over eth0 → UCS-2:6443
  4. Labels itself role=agents + dut-data-plane=true

Verify from UCS-2:

kubectl get nodes -o wide
# Expect:
# NAME    STATUS  ROLES   AGE  VERSION  LABELS
# ucs-1   Ready   agent   1m   v1.31.x  role=agents,dut-data-plane=true,...
# ucs-2   Ready   master  3m   v1.31.x  role=ngfw-dut,dut-data-plane=true,...

Step 3 — VLAN setup per node

The install script performs this automatically. To verify:

UCS-1:

ip -br link | grep eth1\\.
# eth1.20  UP  (172.16.0.1/16)
# eth1.30  UP  (172.17.0.1/16)

UCS-2:

ip -br link | grep eth1\\.
# eth1.40  UP  (no IP — DHCP via cloner pod macvlan)
# eth1.99  UP  (192.168.90.1/24)
# eth1.101 UP  (10.1.1.1/27)
# … through eth1.120 (10.1.20.1/27)
# eth1.200 UP  (10.2.1.1/27)
# … through eth1.209 (10.2.10.1/27)

VLAN persistence across reboots is handled by netplan (script writes /etc/netplan/99-*.yaml).


Step 4 — Apply manifests

From UCS-2:

# Create the NGFW CA configmap (once)
kubectl create configmap ngfw-ca -n web-agents --from-file=ngfw-ca.crt=<path-to-cert>

# Create application secrets from .env
kubectl create secret generic web-agent-secrets -n web-agents --from-env-file=.env

# Apply everything
sudo ./scripts/k8s-install.sh --mode=dual-apply

This applies:

  1. kubectl apply -k overlays/dual-node/ — base k8s manifests with browser-engine/synthetic-load pinned to role=agents and Dashboard/Postgres/PgBouncer/Cloner pinned to role=ngfw-dut
  2. kubectl apply -k k8s/dut/ — DUT overlay (NADs, SNMP probes, NFS server, node-tuning DaemonSet, cAdvisor ServiceMonitor, infra Prometheus rules)
  3. Patches node-tuning DaemonSet nodeSelector to dut-data-plane=true (runs on both UCS)
  4. Patches browser-engine + synthetic-load deployments via 40-playwright-patch.yaml / 50-k6-patch.yaml — Multus net1 macvlan annotation, NGFW CA trust, REJECT_INVALID_CERTS=true
  5. Re-pins browser-engine + synthetic-load to role=agents (the patch files default to role=ngfw-dut for single-node compatibility — dual-node overrides)
  6. Waits for all pods to become Ready

Step 5 — Verify the deployment

# Pods on UCS-1 (agents)
kubectl get pods -n web-agents -o wide --field-selector spec.nodeName=ucs-1
# Expect: web-agent-* and k6-agent-* only

# Pods on UCS-2 (services + personas)
kubectl get pods -n web-agents -o wide --field-selector spec.nodeName=ucs-2
# Expect: dashboard, postgres, pgbouncer, cloner, nfs-server,
#         persona-shop-*, persona-news-*, …, clone-persona-1-*, …

# Synthetic + cloned personas in their own namespaces
kubectl get pods --all-namespaces -o wide | grep -E "persona-|clone-persona-"

# Multus net1 attached on agents
kubectl exec -n web-agents deploy/web-agent -- ip -br addr | grep net1
# Expect a 172.16.x.x address

# Check NFS server is reachable from cloner
kubectl exec -n web-agents deploy/cloner -- mount | grep nfs

# End-to-end smoke test — agents hit a persona via the NGFW
kubectl exec -n web-agents deploy/web-agent -- \
  curl -sf https://shop.persona.local/ | head -1

6. Observability — Grafana and Prometheus in dual-node

Observability runs entirely on UCS-2 (role=ngfw-dut):

  • Prometheus — scrapes both UCS over OOBI (eth0) at port :9100 for node_exporter, kubelet :10250 for cAdvisor, and :8080 for kube-state-metrics
  • GrafanaTest-Bed Infrastructure Health dashboard auto-adapts to two-node topology via per-host filters and count(node_exporter) recording rules
  • Alerts — the composite TestBedInfrastructureBottleneck, host-level (HostUDPBufferOverflow, HostConntrackNearFull, etc.) and pod-level (PodCPUThrottled, OOMKilled) alerts all work out-of-the-box. See MONITORING_TEST_VALIDITY.md for the full alert catalogue

The node_exporter DaemonSet has no nodeSelector and tolerations: operator: Exists, so it covers both UCS without further configuration. The NodeExporterCoverageIncomplete alert fires automatically if either node stops reporting host metrics.

To open Grafana from your operator workstation (assuming you have kubectl access to UCS-2):

kubectl port-forward -n web-agents svc/grafana 3000:3000
# then visit http://localhost:3000

7. Troubleshooting

UCS-1 fails to join the cluster

Symptoms: dual-agent install hangs at Joining k3s cluster as agent.

# On UCS-1 — check OOBI reachability to UCS-2
ping <UCS-2-OOBI-IP>
nc -vz <UCS-2-OOBI-IP> 6443

# On UCS-1 — check k3s-agent service status and logs
sudo systemctl status k3s-agent
sudo journalctl -u k3s-agent -n 100

Most common causes: OOBI subnet mismatch, firewall blocking 6443, or wrong JOIN_TOKEN.

browser-engine pods stay in Pending

kubectl describe pod -n web-agents -l app=web-agent | grep -A4 Events

If you see node(s) didn't match Pod's node affinity/selector, the agent re-pinning step did not run. Apply it manually:

kubectl patch deployment web-agent -n web-agents --type=strategic-merge-patch \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"role":"agents"}}}}}'
kubectl patch deployment k6-agent -n web-agents --type=strategic-merge-patch \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"role":"agents"}}}}}'

Personas stay in Pending

kubectl describe pod -n persona-shop persona-shop-* | grep -A4 Events

If node(s) didn't match again, UCS-2 is not labeled role=ngfw-dut. Re-label:

kubectl label node ucs-2 role=ngfw-dut dut-data-plane=true --overwrite

Cloner cannot reach the internet

Cloner uses eth1.40 macvlan with DHCP. Verify on UCS-2:

ip link show eth1.40
sudo dhclient -v eth1.40   # only as a manual diagnostic — pod has its own DHCP

If eth1.40 is missing, re-run setup_isp_iface (it is part of dual-server).

node_exporter only reports one node

kubectl get pods -n web-agents -l app.kubernetes.io/name=node-exporter -o wide

Both ucs-1 and ucs-2 should appear. If not, check:

kubectl describe daemonset node-exporter -n web-agents | grep -A3 Tolerations

Tolerations should be operator: Exists (matches every taint). If a node has a custom taint blocking it, add a toleration or remove the taint.


8. Reference — overlay file

The dual-node overlay lives in overlays/dual-node/kustomization.yaml. It inherits from k8s/ and adds six strategic-merge patches:

patches:
  # browser-engine + K6 → UCS-1
  - target: { kind: Deployment, name: web-agent }   # role=agents
  - target: { kind: Deployment, name: k6-agent }    # role=agents
  # Services → UCS-2 (reuses role=ngfw-dut)
  - target: { kind: Deployment,  name: dashboard }   # role=ngfw-dut
  - target: { kind: StatefulSet, name: postgres }    # role=ngfw-dut
  - target: { kind: Deployment,  name: pgbouncer }   # role=ngfw-dut
  - target: { kind: Deployment,  name: cloner }      # role=ngfw-dut

To apply manually (without the install script):

kubectl apply -k overlays/dual-node/
kubectl apply -k k8s/dut/
kubectl patch daemonset node-tuning -n web-agents \
  --type=strategic-merge-patch \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"dut-data-plane":"true"}}}}}'

See also