Skip to content

TLSStress.Art — Multi-Node Deployment on 4 UCS Servers

Deployment mode: multi-node — each component tier (personas, browser-engine agents, synthetic-load agents, infrastructure) runs on a dedicated physical server. This maximises throughput and isolates workloads. For a simpler setup where everything runs on one machine, see UBUNTU_K3S_SINGLENODE_QUICKSTART_DEPLOY.en.md.

Goal: Deploy the TLSStress.Art across four dedicated Cisco UCS servers

Last verified against shipping code: v3.7.0 (2026-05-12) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture + the ZTP-prem 12/12 camadas insider-operator posture (25 patent claims, Tier A/B partition, Confidential Computing, sealed audit hash-chain, K8s admission webhook, TPM 2.0 measured-boot, DLP egress monitor, behavioural anomaly detector). ADRs 0014, 0019-0025 cover post-Freeze additions. to achieve maximum throughput for NGFW TLS inspection testing. Each server is dedicated to one role: webservers (personas), browser-engine agents, synthetic-load agents, and shared infrastructure.

Who this is for: Network and systems engineers deploying a physical NGFW test-bed with 30 persona webservers (20 Synthetic + 10 Cloned slots), up to 300 browser-engine agents, and up to 1,000 synthetic-load agents.

Time estimate: 2–3 hours for initial cluster setup; 30 minutes for subsequent teardown and re-deploy.

Author: André Luiz Gallon — agallon@Cisco.com


Use the automated install script to set up each server. The script handles k3s installation, VLAN configuration and Kubernetes dependency setup — no prior Kubernetes knowledge required.

Run once per server, in order:

# Clone the repo on each server
git clone https://github.com/nollagluiz/AI_forSE.git && cd AI_forSE

# UCS-1 — k3s server + persona webservers (VLANs 101–120)
sudo ./scripts/k8s-install.sh --mode=multi-server --role=ngfw-dut --data-iface=eth1
# ↑ Prints JOIN_TOKEN and SERVER_IP at the end — save them for the next steps.

# UCS-2 — browser-engine agents (VLAN 20)
sudo ./scripts/k8s-install.sh --mode=multi-agent --role=playwright \
     --server-ip=<UCS-1-IP> --token=<JOIN_TOKEN> --data-iface=eth1

# UCS-3 — synthetic-load agents (VLAN 30)
sudo ./scripts/k8s-install.sh --mode=multi-agent --role=k6 \
     --server-ip=<UCS-1-IP> --token=<JOIN_TOKEN> --data-iface=eth1

# UCS-4 — Dashboard, Postgres, Grafana (eth0 only, no data-plane NIC needed)
sudo ./scripts/k8s-install.sh --mode=multi-agent --role=infra \
     --server-ip=<UCS-1-IP> --token=<JOIN_TOKEN>

# Back on UCS-1 — apply all Kubernetes manifests
sudo ./scripts/k8s-install.sh --mode=multi-apply

Dry run: add --dry-run to any command to preview what would be executed without making changes to the system.

The rest of this guide explains each step in detail, which is useful for customising the setup or troubleshooting the automated installation.


Table of Contents

  1. Architecture Overview
  2. OOBI Network — What It Is and Why It Is Separate
  3. Prerequisites Per Server
  4. Step 1 — Host Preparation (All Nodes)
  5. Step 2 — k3s Cluster Bootstrap
  6. Step 3 — Install Cluster Dependencies
  7. Step 4 — VLAN Interface Setup
  8. Step 5 — Create Secrets
  9. Step 6 — Apply Multi-Node Overlay
  10. Step 7 — Apply DUT Overlay
  11. Step 8 — Platform, DNS, and Personas
  12. Verification Per Tier
  13. Troubleshooting
  14. Day-to-Day Operations
  15. Reference — Overlay Files

1. Architecture Overview

Physical topology

                        ┌─────────────────────────────────────────────────────┐
                        │            OOBI network — eth0 on all nodes         │
                        │               10.0.0.0/24 (example)                 │
                        │  k3s API :6443 · kubelet · Prometheus :9091 · DNS   │
                        └───┬───────────────┬────────────────┬────────────────┘
                            │               │                │
             ┌──────────────┴──┐  ┌─────────┴──────┐  ┌────┴──────────────────┐
             │    UCS-1         │  │    UCS-2        │  │    UCS-3             │
             │  role=ngfw-dut   │  │  role=playwright│  │  role=k6             │
             │                  │  │                 │  │                      │
             │  20 Caddy pods   │  │  browser engine     │  │  synthetic-load agent pods       │
             │  (personas)      │  │  agent pods     │  │  (1–1000)            │
             │  eth0 (OOBI)     │  │  eth0 (OOBI)   │  │  eth0 (OOBI)         │
             │  eth1 (trunk)    │  │  eth1 (trunk)   │  │  eth1 (trunk)        │
             └──────────┬───────┘  └────────┬────────┘  └──────────┬───────────┘
                        │                   │                       │
              VLANs 101-120, 99    VLAN 20 (172.16.0.0/16)  VLAN 30 (172.17.0.0/16)
              10.1.x.0/27                   │                       │
                        │                   │                       │
                        │          ┌────────┴───────────────────────┴────────┐
                        │          │          Nexus 9000 switch               │
                        │          │  VLAN trunk · ECMP · DSCP AF41 · MTU 9216│
                        │          └────────────────────┬────────────────────┘
                        │                               │
                        │          ┌────────────────────┴────────────────────┐
                        └──────────┤  NGFW (Device Under Test)               │
                                   │  TLS leg-1: agents → NGFW               │
                                   │  TLS leg-2: NGFW → Caddy webservers     │
                                   │  NGFW CA trusted by agents              │
                                   │  persona-ca trusted by NGFW             │
                                   └─────────────────────────────────────────┘

             ┌──────────────────────────────────────────────────────────────────┐
             │                    UCS-4  role=infra                             │
             │                                                                  │
             │  Dashboard · Postgres · PgBouncer · Grafana · cert-manager       │
             │  eth0 only (OOBI) — no data-plane traffic                        │
             └──────────────────────────────────────────────────────────────────┘

Traffic flow (data plane — eth1 only)

browser-engine agent (UCS-2, VLAN 20)  ──HTTPS──►  NGFW (TLS leg-1 decrypt)
synthetic-load agent       (UCS-3, VLAN 30)  ──HTTPS──►  NGFW (TLS leg-1 decrypt)
                                                 │
                                                 │  TLS leg-2 (re-encrypt)
                                                 ▼
                                   Caddy webserver (UCS-1, VLAN 101-120)
                                   ◄──── Prometheus metrics via eth0 ──────

Node role summary

Server Role label Workloads NICs VLANs
UCS-1 role=ngfw-dut 20 Synthetic Persona Caddy webservers + 10 Cloned Persona slots eth0 (OOBI/k8s) + eth1 (VLAN trunk) 101–120 (10.1.x.0/27) + 200–209 (10.2.{1..10}.0/27) + VLAN 99 (SNMP 192.168.90.0/24)
UCS-2 role=playwright browser-engine agents (1–300) eth0 (OOBI/k8s) + eth1 (VLAN trunk) VLAN 20 (172.16.0.0/16)
UCS-3 role=k6 synthetic-load agents (1–1000) eth0 (OOBI/k8s) + eth1 (VLAN trunk) VLAN 30 (172.17.0.0/16)
UCS-4 role=infra Dashboard, Postgres, Grafana, cert-manager eth0 (OOBI/k8s) only None

2. OOBI Network

OOBI stands for Out-Of-Band Infrastructure. It is the eth0 interface on every UCS server and carries all Kubernetes control plane and management traffic. All four UCS servers must be on the same OOBI L2 segment (e.g., 10.0.0.0/24) so that k3s, etcd, kubelet, and Prometheus can communicate without traversing the NGFW.

What the OOBI network carries

Traffic type Port Direction
k3s API server TCP 6443 All agents → UCS-1
kubelet heartbeats TCP 10250 All nodes → UCS-1
etcd TCP 2379–2380 UCS-1 internal (single-server k3s)
Flannel overlay (VXLAN) UDP 8472 All nodes ↔ all nodes
Prometheus scrape (Caddy) TCP 9091 UCS-4 → UCS-1 (eth0 on Caddy pods)
Dashboard → Postgres TCP 5432 UCS-4 internal (cluster DNS)
cert-manager → API server TCP 6443 UCS-4 → UCS-1

What the OOBI network does NOT carry

The data plane — HTTPS test traffic between agents and Caddy webservers — flows exclusively over eth1 through the Nexus 9000 switch and NGFW. The macvlan interfaces on eth1 bypass iptables, kube-proxy, and Kubernetes NetworkPolicy entirely. This is intentional: it ensures measured latency and throughput reflect only the NGFW and not kernel forwarding.

OOBI isolation requirement

Do not route OOBI through the NGFW. If the NGFW reboots or is misconfigured, the k3s cluster must remain operational so you can investigate pod logs and metrics.


3. Prerequisites Per Server

Common to all nodes

Item Value
OS Ubuntu 22.04 LTS or 24.04 LTS (x86_64)
RAM 64 GB minimum (128 GB recommended for UCS-2 and UCS-3)
vCPUs 16+ (UCS-1/2/3); 8+ (UCS-4)
Disk 200 GB+ free on /var
User sudo access
Connectivity All 4 nodes reachable on OOBI L2 segment before install
Internet access Required for bootstrap (k3s binary + GHCR container images); can be offline after initial pull

Per-server specifics

Server Additional requirements
UCS-1 eth1 connected to Nexus 9000 trunk port (VLANs 99, 101–120 allowed); NGFW connected on VLAN 99 side
UCS-2 eth1 connected to Nexus 9000 trunk port (VLAN 20 allowed)
UCS-3 eth1 connected to Nexus 9000 trunk port (VLAN 30 allowed)
UCS-4 eth0 only; StorageClass local-path (default in k3s) sufficient for Postgres PVC

Software installed before you begin

# On every node:
sudo apt update && sudo apt install -y \
  curl ca-certificates iptables openssl jq git \
  python3 iproute2 vlan net-tools

Step 1 — Host Preparation (All Nodes)

Run the following on every UCS server before joining the k3s cluster.

1.1 Load 8021q kernel module and make it persistent

sudo modprobe 8021q
echo "8021q" | sudo tee /etc/modules-load.d/8021q.conf

1.2 Disable swap

sudo swapoff -a
sudo sed -i '/\bswap\b/d' /etc/fstab

1.3 Increase inotify limits (required for Kubernetes)

cat <<'EOF' | sudo tee /etc/sysctl.d/99-k8s.conf
fs.inotify.max_user_watches   = 524288
fs.inotify.max_user_instances = 512
EOF
sudo sysctl --system

1.4 Verify OOBI connectivity

From each node, confirm that all other nodes are reachable on their OOBI IPs before proceeding. Replace the example IPs with your actual OOBI addresses.

# Example: from UCS-2
ping -c 2 10.0.0.1   # UCS-1 (future k3s server)
ping -c 2 10.0.0.3   # UCS-3
ping -c 2 10.0.0.4   # UCS-4

Step 2 — k3s Cluster Bootstrap

UCS-1 is the k3s server (runs the control plane). UCS-2, UCS-3, and UCS-4 are k3s agents that join the cluster. All cluster traffic uses eth0 (the OOBI interface) via --flannel-iface=eth0.

2.1 UCS-1 — Install k3s server

curl -sfL https://get.k3s.io | \
  INSTALL_K3S_EXEC="--disable=traefik \
    --write-kubeconfig-mode=644 \
    --flannel-iface=eth0 \
    --node-label=role=ngfw-dut \
    --node-label=dut-data-plane=true" sh -

# Wait for the node to become Ready:
sudo k3s kubectl get nodes --watch

Expected output:

NAME    STATUS   ROLES                  AGE   VERSION
ucs-1   Ready    control-plane,master   60s   v1.32.x+k3s1

2.2 Retrieve the join token

sudo cat /var/lib/rancher/k3s/server/node-token
# Copy this value — you will need it for UCS-2, 3, and 4.

2.3 Configure kubectl on UCS-1 (no sudo)

mkdir -p ~/.kube
cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
chmod 600 ~/.kube/config
export KUBECONFIG=~/.kube/config
echo 'export KUBECONFIG=~/.kube/config' >> ~/.bashrc

2.4 UCS-2 — Join as browser-engine node

# Replace <UCS1_OOBI_IP> and <TOKEN> with actual values.
curl -sfL https://get.k3s.io | \
  K3S_URL=https://<UCS1_OOBI_IP>:6443 \
  K3S_TOKEN=<TOKEN> \
  INSTALL_K3S_EXEC="--node-label=role=playwright \
    --node-label=dut-data-plane=true \
    --flannel-iface=eth0" sh -

2.5 UCS-3 — Join as synthetic-load node

curl -sfL https://get.k3s.io | \
  K3S_URL=https://<UCS1_OOBI_IP>:6443 \
  K3S_TOKEN=<TOKEN> \
  INSTALL_K3S_EXEC="--node-label=role=k6 \
    --node-label=dut-data-plane=true \
    --flannel-iface=eth0" sh -

2.6 UCS-4 — Join as Infra node

curl -sfL https://get.k3s.io | \
  K3S_URL=https://<UCS1_OOBI_IP>:6443 \
  K3S_TOKEN=<TOKEN> \
  INSTALL_K3S_EXEC="--node-label=role=infra \
    --flannel-iface=eth0" sh -

2.7 Verify all four nodes (run from UCS-1)

kubectl get nodes -o wide

Expected output:

NAME    STATUS   ROLES                  LABELS (abbreviated)
ucs-1   Ready    control-plane,master   role=ngfw-dut,dut-data-plane=true
ucs-2   Ready    <none>                 role=playwright,dut-data-plane=true
ucs-3   Ready    <none>                 role=k6,dut-data-plane=true
ucs-4   Ready    <none>                 role=infra

If a node shows NotReady, check its k3s agent service:

# On the problematic node:
sudo journalctl -u k3s-agent -f --no-pager | tail -30

Step 3 — Install Cluster Dependencies

Run all commands from UCS-1 with KUBECONFIG set.

3.1 Multus CNI

Required on UCS-1, UCS-2, and UCS-3 so that Caddy, browser engine, and synthetic-load pods can attach macvlan interfaces.

kubectl apply -f \
  https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml

kubectl -n kube-system wait \
  --for=condition=ready pod -l app=multus --timeout=180s

3.2 cert-manager

Required for the DUT PKI (persona-ca-issuer ClusterIssuer in platform/pki/ signs TLS certificates for all 20 Synthetic Personas and 10 Cloned Persona slots).

kubectl apply -f \
  https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml

kubectl -n cert-manager wait \
  --for=condition=Available deploy --all --timeout=180s

3.3 Helm (for kube-prometheus-stack)

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

3.4 kube-prometheus-stack (Grafana + Prometheus)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=30300 \
  --set prometheus.prometheusSpec.scrapeInterval=15s \
  --wait

Stakater Reloader automatically restarts pods when their ConfigMaps or Secrets change — used by the webserver and SNMP exporter.

helm repo add stakater https://stakater.github.io/stakater-charts
helm repo update
helm upgrade --install reloader stakater/reloader \
  --namespace reloader --create-namespace \
  --wait

Step 4 — VLAN Interface Setup

Each data-plane node (UCS-1, UCS-2, UCS-3) needs its VLAN subinterfaces created before applying Multus NetworkAttachmentDefinitions. VLAN interfaces must be persistent across reboots.

4.1 UCS-1 — Persona VLANs (101–120) and SNMP VLAN 99

The repo ships scripts/netsetup-personas.sh and scripts/netsetup-dut.sh for this purpose.

# On UCS-1:
sudo DUT_DATA_IFACE=eth1 bash scripts/netsetup-personas.sh setup
# Creates eth1.101 through eth1.120 with 10.1.x.1 addresses.

sudo DUT_DATA_IFACE=eth1 bash scripts/netsetup-dut.sh setup
# Creates eth1.99 (SNMP management VLAN).

Verify:

ip -d link show | grep -E 'eth1\.(99|10[0-9]|1[012][0-9])'
# Each subinterface should show: link/ether ... vlan id <N>

4.2 UCS-2 — browser-engine VLAN 20

# On UCS-2 (run as root or with sudo):
ip link add link eth1 name eth1.20 type vlan id 20
ip addr add 172.16.0.1/16 dev eth1.20
ip link set eth1.20 up

To make it persistent with Netplan:

cat <<'EOF' | sudo tee /etc/netplan/60-vlan20.yaml
network:
  version: 2
  vlans:
    eth1.20:
      id: 20
      link: eth1
      addresses:
        - 172.16.0.1/16
EOF
sudo netplan apply

4.3 UCS-3 — synthetic-load VLAN 30

# On UCS-3:
ip link add link eth1 name eth1.30 type vlan id 30
ip addr add 172.17.0.1/16 dev eth1.30
ip link set eth1.30 up

Netplan persistence:

cat <<'EOF' | sudo tee /etc/netplan/60-vlan30.yaml
network:
  version: 2
  vlans:
    eth1.30:
      id: 30
      link: eth1
      addresses:
        - 172.17.0.1/16
EOF
sudo netplan apply

4.4 Verify VLAN interfaces are up before continuing

From UCS-1:

ip -brief addr show | grep -E 'eth1\.(99|10[0-9]|1[012][0-9])'
# Expected: each interface shows UP and an IP address.

Step 5 — Create Secrets

All secrets must be created in the web-agents namespace before applying any workloads.

5.1 Create the namespace first

kubectl apply -f k8s/00-namespace.yaml

5.2 browser-engine agent secrets

kubectl -n web-agents create secret generic web-agent-secrets \
  --from-literal=AGENT_API_TOKEN="<64-char-hex-token>" \
  --from-literal=DASHBOARD_URL="http://dashboard.web-agents.svc.cluster.local:3000"

5.3 synthetic-load agent secrets

kubectl -n web-agents create secret generic k6-agent-secrets \
  --from-literal=AGENT_API_TOKEN="<same-64-char-hex-token>" \
  --from-literal=DASHBOARD_URL="http://dashboard.web-agents.svc.cluster.local:3000"

5.4 Postgres credentials

kubectl -n web-agents create secret generic postgres-credentials \
  --from-literal=POSTGRES_USER="agent_dashboard" \
  --from-literal=POSTGRES_PASSWORD="<strong-db-password>" \
  --from-literal=POSTGRES_DB="agent_dashboard"

5.5 Dashboard secrets

kubectl -n web-agents create secret generic dashboard-secrets \
  --from-literal=DATABASE_URL="postgresql://agent_dashboard:<password>@postgres.web-agents.svc.cluster.local:5432/agent_dashboard?sslmode=require" \
  --from-literal=AGENT_API_TOKEN="<same-token>" \
  --from-literal=ADMIN_BASIC_AUTH="admin:<strong-admin-password>" \
  --from-literal=SESSION_SECRET="<64-char-random-hex>"

5.6 NGFW CA ConfigMap (DUT mode)

Export the CA certificate from your NGFW and create the ConfigMap:

# Replace ngfw-ca.crt with the actual path to your NGFW's CA certificate.
kubectl -n web-agents create configmap ngfw-ca \
  --from-file=ngfw-ca.crt=./ngfw-ca.crt

5.7 SNMP secrets and ConfigMap (DUT mode)

kubectl -n web-agents create secret generic dut-snmp-secrets \
  --from-literal=SNMP_COMMUNITY="<your-snmp-community>"

kubectl -n web-agents create configmap snmp-config \
  --from-file=snmp.yml=./k8s/dut/snmp.yml

Step 6 — Apply Multi-Node Overlay

The multi-node Kustomize overlay at k8s/overlays/multi-node/ patches nodeSelector fields on each Deployment and StatefulSet so that workloads land on their designated UCS server.

# From the repo root on UCS-1:
kubectl apply -k k8s/overlays/multi-node/

This single command replaces the single-node kubectl apply -f k8s/ and applies every resource in k8s/ with the appropriate nodeSelector patches:

  • web-agent Deployment → role: playwright (UCS-2)
  • k6-agent Deployment → role: k6 (UCS-3)
  • dashboard Deployment → role: infra (UCS-4)
  • postgres StatefulSet → role: infra (UCS-4)
  • pgbouncer Deployment → role: infra (UCS-4)

Wait for base workloads to start:

kubectl -n web-agents get pods -o wide --watch
# Ctrl+C once all pods show Running.

Step 7 — Apply DUT Overlay

The DUT overlay adds the macvlan NetworkAttachmentDefinitions, persona webservers, SNMP exporter, TLS PKI, and performance tuning.

7.1 Apply the DUT kustomization

kubectl apply -k k8s/dut/

7.2 Patch the agent deployments (add macvlan + CA trust)

The strategic merge patches add the net1 macvlan annotation, CA trust environment variables, and REJECT_INVALID_CERTS=true to the existing agent deployments.

kubectl patch deployment web-agent -n web-agents \
  --type=strategic-merge-patch \
  --patch-file=k8s/dut/40-playwright-patch.yaml

kubectl patch deployment k6-agent -n web-agents \
  --type=strategic-merge-patch \
  --patch-file=k8s/dut/50-k6-patch.yaml

7.3 Apply host-level tuning on every node — REQUIRED for persona stacks

The Synthetic Persona and Cloned Persona Caddy webservers REQUIRE this tuning to deliver their target throughput. Without it, kernel UDP buffers cap QUIC at ~30 Mbps per replica and the kernel resets cwnd on every HTTP/2 idle window, adding ~1 ms RTT to every cycle.

The in-cluster node-tuning DaemonSet applies the same kernel knobs at pod start, but does not survive a host reboot until the pod comes back up. The scripts/host-tuning.sh script writes the values into /etc/sysctl.d/, installs a systemd unit for the CPU governor, and (optionally) flips on cpuManagerPolicy: static for exclusive-core allocation — making the DaemonSet a belt-and-braces re-applier rather than the only line of defence.

Run on EVERY UCS host that schedules a Synthetic Persona, Cloned-Persona slot, or the Cloner. In a 4-node deploy that is UCS-1 (where personas + slots run) and UCS-4 (where the Cloner + NFS server run):

# UCS-1 (role=ngfw-dut) — 20 personas + 10 cloned-persona slots
sudo scripts/host-tuning.sh apply --enable-cpu-pinning

# UCS-4 (role=infra) — Cloner + NFS server
sudo scripts/host-tuning.sh apply --enable-cpu-pinning

# Verify
sudo scripts/host-tuning.sh status

--enable-cpu-pinning toggles cpuManagerPolicy: static on the kubelet (vanilla and k3s both supported, auto-detected) and restarts the kubelet, so plan a brief maintenance window. Without the flag the script still applies sysctls, modules, governor, and THP — pinning is the only step that touches running workloads.

UCS-2 (browser engine) and UCS-3 (synthetic-load engine) only run agent pods. They benefit from the sysctls but do not need pinning. Run the script there too, without --enable-cpu-pinning:

sudo scripts/host-tuning.sh apply   # sysctls + modules + governor only

7.4 Expand node-tuning DaemonSet to data-plane nodes

The node-tuning DaemonSet (k8s/dut/85-node-tuning.yaml) defaults to nodeSelector: role=ngfw-dut. In multi-node mode it must also run on UCS-2 and UCS-3, because those nodes also carry macvlan data-plane traffic and need the same UDP buffer and TCP tuning at runtime.

kubectl -n web-agents patch daemonset node-tuning \
  --type=strategic-merge-patch -p \
  '{"spec":{"template":{"spec":{"nodeSelector":{"dut-data-plane":"true"}}}}}'

This changes the selector from role=ngfw-dut to dut-data-plane=true, which matches UCS-1, UCS-2, and UCS-3 (all labeled dut-data-plane=true during k3s install).

7.4 Remove base NetworkPolicies that conflict with macvlan

kubectl delete networkpolicy web-agent-egress k6-agent-egress \
  -n web-agents --ignore-not-found

Step 8 — Platform, DNS, and Personas

This corresponds to Phase 3 and Phase 4 of the scripts/k8s-dut-up.sh bring-up script.

8.1 Apply platform (PKI ClusterIssuer + Grafana dashboards)

kubectl apply -k platform/

8.2 Patch CoreDNS for persona.internal zone

The CoreDNS patch script does a safe merge so it does not overwrite the existing cluster DNS configuration.

bash platform/dns/patch-coredns.sh

8.3 Apply persona namespaces and deployments

kubectl apply -k personas/

Personas are pinned to UCS-1 via the role=ngfw-dut nodeSelector already set in the persona manifests under personas/_generated/.

8.4 Wait for all pods to be ready

kubectl -n web-agents get pods -o wide
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

12. Verification Per Tier

Cluster-wide node placement

kubectl get pods -n web-agents -o wide --show-labels

UCS-1 — Persona webservers

# Confirm all persona pods are on UCS-1:
kubectl get pods -n web-agents -o wide \
  --field-selector spec.nodeName=ucs-1

# Confirm persona VLANs are attached:
kubectl -n web-agents exec -it \
  $(kubectl -n web-agents get pod -l app.kubernetes.io/name=caddy-dut \
    -o name | head -1) -- ip addr show net1

UCS-2 — browser-engine agents

# Confirm browser-engine pods are on UCS-2:
kubectl get pods -n web-agents -l app.kubernetes.io/name=web-agent -o wide
# NODE column should show ucs-2 for all rows.

# Confirm net1 macvlan interface is present in a pod:
kubectl -n web-agents exec -it \
  $(kubectl -n web-agents get pod -l app.kubernetes.io/name=web-agent \
    -o name | head -1) -- ip addr show net1
# Expected: 172.16.x.x/16

UCS-3 — synthetic-load agents

# Confirm synthetic-load pods are on UCS-3:
kubectl get pods -n web-agents -l app.kubernetes.io/name=k6-agent -o wide
# NODE column should show ucs-3 for all rows.

kubectl -n web-agents exec -it \
  $(kubectl -n web-agents get pod -l app.kubernetes.io/name=k6-agent \
    -o name | head -1) -- ip addr show net1
# Expected: 172.17.x.x/16

UCS-4 — Infra tier

# Dashboard, Postgres, PgBouncer must be on UCS-4:
kubectl get pods -n web-agents \
  -l 'app.kubernetes.io/name in (dashboard,postgres,pgbouncer)' -o wide
# NODE column should show ucs-4 for all rows.

# Postgres PVC stays on UCS-4 local disk:
kubectl get pvc -n web-agents
# STATUS should be Bound.

node-tuning DaemonSet

# Should run on UCS-1, UCS-2, UCS-3 (all dut-data-plane=true nodes):
kubectl -n web-agents get pods -l app.kubernetes.io/name=node-tuning -o wide
# Expected: 3 pods, one per data-plane node.

cert-manager PKI

kubectl get secret persona-ca-bundle -n web-agents
# Expected: Secret present with ca.crt key.

kubectl get clusterissuers
# Expected: persona-ca-issuer — READY=True.

Prometheus scraping

# Port-forward Prometheus UI to verify targets:
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090 &
# Open http://localhost:9090/targets
# Confirm caddy_metrics targets at :9091 show UP.

13. Troubleshooting

Pod scheduled on the wrong node

Symptom: kubectl get pods -o wide shows a pod on a different UCS server than expected.

Cause: The nodeSelector in the overlay was not applied, or the node label is missing.

# Check node labels:
kubectl get nodes --show-labels | grep -E 'role='

# Re-label if missing:
kubectl label node ucs-2 role=playwright --overwrite
kubectl label node ucs-3 role=k6 --overwrite
kubectl label node ucs-4 role=infra --overwrite

# Force pod rescheduling after label fix:
kubectl -n web-agents rollout restart deployment web-agent

VLAN interface missing on a node — pod stuck in Init or CrashLoopBackOff

Symptom: Multus annotation fails; pod logs show failed to find master interface eth1.20.

Cause: The VLAN subinterface was not created before the pod was scheduled.

# On the affected node (e.g., UCS-2):
ip link show eth1.20
# If missing:
ip link add link eth1 name eth1.20 type vlan id 20
ip addr add 172.16.0.1/16 dev eth1.20
ip link set eth1.20 up

# Delete the stuck pod so it re-creates:
kubectl -n web-agents delete pod <pod-name>

OOBI routing broken — nodes NotReady

Symptom: kubectl get nodes shows one or more nodes as NotReady.

Diagnosis:

# On UCS-1:
sudo k3s kubectl get nodes

# On the NotReady node:
sudo journalctl -u k3s-agent --no-pager -n 50
# Look for: "connection refused :6443" or "x509: certificate"

# Verify OOBI reachability:
ping -c 2 <UCS1_OOBI_IP>
curl -k https://<UCS1_OOBI_IP>:6443/healthz

Common fix: The k3s agent joined with the wrong K3S_URL (using the data-plane IP instead of the OOBI IP). Uninstall and rejoin:

# On the affected node:
/usr/local/bin/k3s-agent-uninstall.sh

# Re-join with the correct OOBI IP:
curl -sfL https://get.k3s.io | \
  K3S_URL=https://<UCS1_OOBI_IP>:6443 \
  K3S_TOKEN=<TOKEN> \
  INSTALL_K3S_EXEC="--node-label=role=<role> --flannel-iface=eth0" sh -

PVC stuck in Pending

Symptom: kubectl get pvc -n web-agents shows STATUS=Pending.

Cause: k3s local-path StorageClass creates PVs on the node where the pod is first scheduled. If Postgres has no nodeSelector pointing to UCS-4, the PVC may be claimed on a different node and become stuck when Postgres moves.

# Confirm the multi-node overlay patch is applied:
kubectl get statefulset postgres -n web-agents \
  -o jsonpath='{.spec.template.spec.nodeSelector}'
# Expected: {"role":"infra"}

# If not, re-apply the overlay:
kubectl apply -k k8s/overlays/multi-node/

# If PVC is already bound to the wrong node, delete and recreate:
kubectl -n web-agents delete pvc <pvc-name>
kubectl apply -k k8s/overlays/multi-node/

Macvlan pod cannot reach NGFW (no HTTPS responses)

Symptom: browser engine or synthetic-load agents report connection timeouts on net1.

Diagnosis checklist:

# 1. Confirm net1 IP is assigned:
kubectl -n web-agents exec -it <agent-pod> -- ip addr show net1

# 2. Confirm NGFW gateway is reachable from the pod:
kubectl -n web-agents exec -it <agent-pod> -- ping -c 2 172.16.0.1  # browser-engine
kubectl -n web-agents exec -it <agent-pod> -- ping -c 2 172.17.0.1  # K6

# 3. Confirm the VLAN subinterface is UP on the host:
ip link show eth1.20    # On UCS-2
ip link show eth1.30    # On UCS-3

# 4. Confirm the Nexus switch has the VLAN allowed on the trunk port:
# On Nexus: show interfaces trunk | include eth1/X

node-tuning DaemonSet not running on UCS-2 or UCS-3

Symptom: kubectl -n web-agents get pods -l app.kubernetes.io/name=node-tuning -o wide shows the DaemonSet only on UCS-1.

Cause: The DaemonSet nodeSelector was not patched after kubectl apply -k k8s/dut/.

# Check current nodeSelector:
kubectl -n web-agents get daemonset node-tuning \
  -o jsonpath='{.spec.template.spec.nodeSelector}'

# If it shows {"role":"ngfw-dut"}, apply the multi-node patch:
kubectl -n web-agents patch daemonset node-tuning \
  --type=strategic-merge-patch -p \
  '{"spec":{"template":{"spec":{"nodeSelector":{"dut-data-plane":"true"}}}}}'

14. Day-to-Day Operations

Scale browser-engine agents

kubectl -n web-agents scale deployment web-agent --replicas=<N>
# Maximum: 300 replicas (hardware-limited by UCS-2 CPU/RAM and VLAN 20 /16 space)

Scale synthetic-load agents

kubectl -n web-agents scale deployment k6-agent --replicas=<N>
# Maximum: 1000 replicas (hardware-limited by UCS-3 and HPA maxReplicas)

Roll all pods on a tier

kubectl -n web-agents rollout restart deployment web-agent  # UCS-2
kubectl -n web-agents rollout restart deployment k6-agent   # UCS-3
kubectl -n web-agents rollout restart deployment dashboard  # UCS-4

Check resource usage per node

kubectl top nodes
kubectl top pods -n web-agents --sort-by=cpu

Tear down the DUT overlay (keep base stack)

# Remove DUT resources:
kubectl delete -k k8s/dut/ --ignore-not-found

# Restore base NetworkPolicies:
kubectl apply -f k8s/30-agent-network-policy.yaml
kubectl apply -f k8s/31-k6-agent-network-policy.yaml

# Remove macvlan patches from agent deployments:
kubectl -n web-agents rollout restart deployment web-agent k6-agent

Full teardown

kubectl delete namespace web-agents
kubectl delete clusterissuer persona-ca-issuer persona-ca-issuer --ignore-not-found
kubectl delete -k platform/ --ignore-not-found
kubectl delete -k personas/ --ignore-not-found

Rotate Postgres password

kubectl -n web-agents create secret generic postgres-credentials \
  --from-literal=POSTGRES_USER="agent_dashboard" \
  --from-literal=POSTGRES_PASSWORD="<new-password>" \
  --from-literal=POSTGRES_DB="agent_dashboard" \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl -n web-agents create secret generic dashboard-secrets \
  --from-literal=DATABASE_URL="postgresql://agent_dashboard:<new-password>@postgres.web-agents.svc.cluster.local:5432/agent_dashboard?sslmode=require" \
  --from-literal=AGENT_API_TOKEN="<token>" \
  --from-literal=ADMIN_BASIC_AUTH="admin:<admin-password>" \
  --from-literal=SESSION_SECRET="<session-secret>" \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl -n web-agents rollout restart deployment dashboard pgbouncer
kubectl -n web-agents rollout restart statefulset postgres

Renew NGFW CA certificate

# Export new CA from NGFW, then:
kubectl -n web-agents create configmap ngfw-ca \
  --from-file=ngfw-ca.crt=./ngfw-ca.crt \
  --dry-run=client -o yaml | kubectl apply -f -

# Stakater Reloader triggers an automatic rolling restart of pods that
# reference the ngfw-ca ConfigMap. Verify:
kubectl -n web-agents get pods -w

View Grafana dashboards

# Grafana is exposed on NodePort 30300 on UCS-4:
# http://<UCS4_OOBI_IP>:30300
# Default credentials: admin / prom-operator (change immediately)

15. Reference — Overlay Files

k8s/overlays/multi-node/kustomization.yaml

The root overlay file. Inherits the k8s/ base (all resources) and applies node placement patches for each workload tier.

k8s/overlays/multi-node/
├── kustomization.yaml            # bases: ../../  (k8s/), lists all patches
├── patch-playwright-nodesel.yaml # nodeSelector: role=playwright → UCS-2
├── patch-k6-nodesel.yaml         # nodeSelector: role=k6        → UCS-3
└── patch-infra-nodesel.yaml      # nodeSelector: role=infra     → UCS-4
                                  # reused for dashboard, postgres, pgbouncer

Patch files

patch-playwright-nodesel.yaml — applied to Deployment/web-agent:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-agent
  namespace: web-agents
spec:
  template:
    spec:
      nodeSelector:
        role: playwright

patch-k6-nodesel.yaml — applied to Deployment/k6-agent:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: k6-agent
  namespace: web-agents
spec:
  template:
    spec:
      nodeSelector:
        role: k6

patch-infra-nodesel.yaml — applied to Deployment/dashboard, StatefulSet/postgres, Deployment/pgbouncer:

spec:
  template:
    spec:
      nodeSelector:
        role: infra

Node labels applied at join time

Label Nodes Purpose
role=ngfw-dut UCS-1 Targets persona webserver pods and node-tuning (original selector)
role=playwright UCS-2 Targets browser-engine agent pods
role=k6 UCS-3 Targets synthetic-load agent pods
role=infra UCS-4 Targets Dashboard, Postgres, PgBouncer
dut-data-plane=true UCS-1, UCS-2, UCS-3 Targets node-tuning DaemonSet after multi-node patch

DUT overlay files (unchanged from single-node)

File Purpose
k8s/dut/10-ngfw-ca.yaml ConfigMap with NGFW CA (trusted by agents via NODE_EXTRA_CA_CERTS)
k8s/dut/20-network-attachments.yaml Multus NADs — dut-pw, dut-k6, dut-mgmt (routes 10.1.0.0/16 + 10.2.0.0/16 via NGFW)
k8s/dut/40-playwright-patch.yaml Strategic merge patch — adds net1 + CA trust to browser engine
k8s/dut/50-k6-patch.yaml Strategic merge patch — adds net1 + CA trust to synthetic-load engine
k8s/dut/60-snmp-exporter.yaml SNMP exporter for NGFW metrics
k8s/dut/85-node-tuning.yaml DaemonSet — sysctls, BBR+FQ, CPU governor, THP
platform/pki/ persona-ca-issuer ClusterIssuer — signs certs for all 20 Synthetic + 10 Cloned Persona slots
personas/ 20 Synthetic Persona namespaces — Caddy + cert (VLANs 101–120)
k8s/clone-personas/ 10 Cloned Persona slots — Caddy file_server (VLANs 200–209)

TLSStress.Art v3.7.0 — André Luiz Gallon — agallon@Cisco.com


Public Website Cloner + cross-node storage (multi-node)

In a 4-server multi-node cluster the cloner runs on UCS-4 (role=infra), co-located with the Dashboard. The 10 cloned-persona slots run on UCS-1 (role=ngfw-dut) alongside the synthetic personas. These two tiers live on different physical machines, which means the cloned content has to traverse a real network path between writer and readers.

Storage architecture (read this first if you are deploying multi-node)

The shared cloned-sites volume is backed by an in-cluster NFS server (k8s/dut/35-nfs-server.yaml) that is scheduled on role=infra (UCS-4) — the same node as the cloner. Eleven static PV/PVC pairs (k8s/dut/36-cloned-sites-pvs.yaml) — one read-write writer in web-agents, ten read-only readers in clone-persona-1..10 — all bind to the same NFS export.

UCS-4 (role=infra):    [cloner pod]   [nfs-server pod]
                            │              │
                            │ writes       │ /var/lib/agent-cluster/cloned-sites
                            │ via PVC      │ (hostPath, persistent)
                            ▼              ▼
                       NFSv4 over OOBI (eth0, ClusterIP Service:2049)
                            ▲
                            │ reads, read-only PVC
                            │
UCS-1 (role=ngfw-dut): [clone-persona-1..10]

Critical points:

  • All NFS traffic uses the OOBI control plane (eth0 / pod network → ClusterIP nfs-server.web-agents.svc.cluster.local). The data plane (net1 macvlan VLAN 40 ISP and VLANs 200–209 slot subnets) is never used for storage I/O — it stays exclusive for agent ↔ NGFW ↔ persona traffic so timing measurements remain uncontaminated.
  • The NFS server's underlying disk is the hostPath on UCS-4. Sizing depends on UCS-4's available storage, not on PVC capacity (the PVCs request 50 GiB but the server enforces nothing).
  • The cloner's nodeAffinity (role=infra) and the NFS server's preferredAffinity (role=infra) MUST match. If the cloner is moved to a different node, also move the NFS server — otherwise writes traverse the OOBI for nothing and reads fight for bandwidth.
  • The CNI DHCP daemon DaemonSet (k8s/dut/30-cni-dhcp-daemon.yaml) runs on every node — it is required on UCS-4 (where the cloner is) so that VLAN 40 IPAM can call the dhcp helper via /run/cni/dhcp.sock.

Goal

The cloner downloads real public sites via VLAN 40 (direct internet, no NGFW in the path) and writes them to NFS. The 10 cloned-persona slot pods on UCS-1 then mount the same NFS export read-only and serve the captured content through Caddy. The agent fleet hits those slots through the NGFW, exercising true TLS inspection with content from real-world sites.

Rede — VLAN 40 no UCS-4

O cloner usa VLAN 40 no trunk eth1 do UCS-4. Pré-requisitos no Nexus 9000:

# No Nexus 9000 — permitir VLAN 40 no trunk de UCS-4
conf t
  vlan 40
    name cloner-isp-egress
  interface Ethernet1/<N>       ! porta trunk para UCS-4
    switchport trunk allowed vlan add 40

Instalar no UCS-4

O k8s-install.sh com role=infra cria eth1.40 automaticamente:

# No UCS-4
sudo ./scripts/k8s-install.sh --mode=multi-agent --role=infra \
  --server-ip=<UCS-1-IP> --token=<JOIN_TOKEN>
# Se seu trunk não for eth1, use: --data-iface=<nome>
# Se quiser outra VLAN, use: --isp-iface=<iface.VLAN>

Verificar cloner no UCS-4

kubectl get pod -n web-agents -l app.kubernetes.io/name=cloner -o wide
# NODE deve mostrar o hostname de UCS-4

# Verificar VLAN 40 e acesso à internet
kubectl exec -n web-agents deploy/cloner -- ip addr show net1
kubectl exec -n web-agents deploy/cloner -- ping -c 3 8.8.8.8

HTTPS — sem configuração adicional

O cloner valida certificados TLS de sites públicos usando o bundle Mozilla/NSS do Debian. Não é necessário configurar CA adicional para sites com certificados emitidos por CAs públicas (Let's Encrypt, DigiCert, etc.).

Saúde da internet no Grafana

O dashboard TLSStress.Art mostra: - Painel stat "Acesso Internet (ISP)" (verde/vermelho) - Painel stat "Gateway ISP" (verde = gateway DHCP responde ICMP) - Time-series "Ping RTT Gateway ISP" com os 3 alvos

Alertas disparam se ambos os pings falham por mais de 2 minutos (ClonerInternetDown — critical).

Criar job de clonagem

curl -s -X POST https://<dashboard-hostname>/api/clone/jobs \
  -H "Authorization: Basic $(echo -n 'admin:<ADMIN_PASS>' | base64)" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","personaName":"shop"}'

Troubleshooting rápido

Symptom Check
Cloner pod on the wrong node kubectl get pod -o wide → NODE must be UCS-4
net1 has no IP VLAN 40 on the Nexus trunk; eth1.40 on UCS-4; CNI DHCP daemon Ready: kubectl get ds cni-dhcp-daemon -n web-agents
No internet kubectl exec deploy/cloner -- ping -c 3 -I net1 8.8.8.8
DNS not resolving kubectl exec deploy/cloner -- cat /etc/resolv.conf
Slot pods stuck MountVolume.SetUp failed NFS server not Ready or not on the cloner's node — kubectl -n web-agents get pod -l app.kubernetes.io/name=nfs-server -o wide
Slot pods Ready but serve empty Cloner wrote to NFS server on a DIFFERENT node than where the cloner now runs (rare race; restart NFS server first, then cloner)
cloned-sites PVC Pending in any namespace kubectl describe pv — confirm static binding; PVs are in k8s/dut/36-cloned-sites-pvs.yaml

Ver também

  • docs/CLONER.md — full architecture reference (storage, network, lifecycle)
  • docs/CLONER_OPERATIONS.md — operations and troubleshooting playbook
  • k8s/80-cloner-nad.yaml — NAD ISP (master: eth1.40, VLAN 40)
  • k8s/dut/30-cni-dhcp-daemon.yaml — CNI DHCP daemon DaemonSet (required for VLAN 40 IPAM)
  • k8s/dut/35-nfs-server.yaml — in-cluster NFS server (cross-node cloned-sites over OOBI)
  • k8s/dut/36-cloned-sites-pvs.yaml — 11 static PV/PVC pairs (1 writer + 10 slot readers)
  • overlays/multi-node/kustomization.yaml — pins the cloner to role=infra