Skip to content

Runbook — First real installation of TLSStress.Art

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Audience: Operator who has been granted access via the Access Broker and is about to install TLSStress.Art on real hardware for the first time.

Onboarding sequence: AccessCloneInstallyou are here · alternate: Air-gap install · maintainer setup: Private Repo Setup

Time budget: ~2 hours total. Each step is timed.

Why this runbook exists

The TLSStress.Art codebase has CI green, unit tests passing, and clean linting — but no individual installation has been validated end-to-end against a real Cisco Nexus 9000 + NGFW DUT prior to the operator running this. This runbook is the protocol that takes you from "fresh UCS + Nexus" to "first measured p99 in a Grafana panel" in a controlled, reproducible way.

If any step fails, stop and investigate rather than skipping ahead. Failures at each step have specific recovery procedures linked at the end.

Prerequisites — before step 1

  • Repository access granted via Access Broker (see ACCESS_REQUEST.md)
  • License accepted through the Dashboard's License Acceptance Modal (will appear on first login)
  • One Ubuntu 22.04+ UCS host with at least 32 GB RAM, 16 vCPU, 200 GB disk
  • One Cisco Nexus 9000 with management access via SSH and trunk port available
  • One NGFW DUT (Cisco FTD/ASA/Firepower, Palo Alto, Fortinet, or similar) with management IP reachable from the UCS, and trunk-attached to the Nexus
  • Network IPs reserved:
  • UCS OOBI (eth0): your management network (e.g. 192.168.90.10/24)
  • Nexus management IP for SNMP scrape: e.g. 192.168.90.2
  • NGFW management IP: e.g. 192.168.90.3
  • VLAN 20 (browser-engine agents): 172.16.0.0/16
  • VLAN 30 (synthetic-load agents): 172.17.0.0/16
  • VLANs 101–120 (Synthetic personas): 10.1.{1..20}.0/27, gateway = NGFW
  • VLANs 200–209 (Cloned personas): 10.2.{1..10}.0/27, gateway = NGFW
  • NGFW CA certificate in PEM format, available locally (you will inject it as a ConfigMap)
  • A .env file at the repo root with SNMP_NEXUS_HOST, SNMP_NGFW_HOST, SNMP_COMMUNITY, SNMP_DUT_MODULE (one of cisco_asa, cisco_firepower, palo_alto, fortinet, generic)

If any of these are missing, do not proceed. The runbook depends on all of them.


Step 1 — Lab pre-flight (60–90 minutes)

Goal: confirm the cluster comes up healthy with all 30 personas running, before any DUT involvement.

1.1 Clone the repository onto the UCS

Authenticated clone — see docs/CLONE_FOR_INSTALL.md for the four authentication options. Recommended for permanent installs: deploy key SSH (option B).

ssh ucs-1.example.com
mkdir -p ~/tlsstress && cd ~/tlsstress
# Option A — gh CLI (recommended for first install)
sudo apt update && sudo apt install -y gh
gh auth login   # follow prompts
gh repo clone nollagluiz/AI_forSE .

Expected: directory contains dashboard/, agent/, personas/, k8s/, platform/, docs/, etc. Total clone size around 80 MB.

1.2 Install K3s + Multus + cert-manager

sudo bash scripts/k8s-install.sh single-node

The script: - Installs K3s - Adds Multus CNI - Adds cert-manager - Labels the node role=ngfw-dut and dut-data-plane=true - Sets up persistent volume hostpath if needed

Expected (last few lines):

✅ K3s up
✅ Multus installed
✅ cert-manager ready
✅ node labels applied

Verify:

kubectl get nodes
# expect: 1 node, status Ready, with labels visible via:
kubectl get nodes -o wide --show-labels | grep dut-data-plane

1.3 Apply the base stack

kubectl apply -f k8s/
kubectl apply -k platform/        # cert-manager Issuers + persona-ca-issuer
kubectl apply -k personas/        # 20 Synthetic personas

Wait for personas to become Ready (this takes 2–5 minutes the first time as the seeder generates content):

watch -n 5 'kubectl get pods -A | grep persona-'
# proceed when 20 pods are 1/1 Running with status Ready

If a persona stays in Init:0/1 for more than 5 minutes, see Step 1 troubleshooting below.

1.4 Confirm the Dashboard is reachable

kubectl get svc -n web-agents dashboard
# note the ClusterIP or NodePort

# from the UCS:
curl -sI http://<dashboard-ip>:3000 | head -5
# expect: HTTP/1.1 302 Found  (redirect to /login)

# or via NodePort if you exposed one:
curl -sI http://<UCS-IP>:<NodePort>/login

Open the dashboard in a browser. Expected on first load: - License Acceptance Modal appears (intercept). Fill role + Cisco email + acknowledgements. Accept. - Dashboard renders with empty state — no agents yet. - TLSStress.Art wordmark in the top-left, license footer at the bottom of every page.

1.5 Confirm Prometheus + Grafana stack

kubectl get pods -n observability
# expect: prometheus-0, grafana-XXX, loki-XXX, alertmanager-XXX all Running

Open Grafana (default :3001). Log in with admin / admin.

Expected: - Welcome screen says "TLSStress.Art" (after PR #190 merges) - Dashboards list shows all 10 dashboards prefixed with "TLSStress.Art —" - "TLSStress.Art — Personas — Overview" shows up to 30 personas alive (20 Synthetic always; 10 Cloned slots when active)

1.6 Step 1 checklist before proceeding

  • All 20 Synthetic personas in 1/1 Running
  • Dashboard reachable, License Acceptance accepted, locale rendering correctly
  • Grafana shows 10 dashboards, none "No data"
  • Prometheus has scraping all targets (Status → Targets, all UP)
  • No CrashLoopBackOff in kubectl get pods -A

If any item fails — do not proceed.


Step 2 — DUT connection (30 minutes)

Goal: route data-plane traffic through the NGFW, with TLS decrypt active.

2.1 Apply Nexus 9000 tuning

# From a machine with reachability to the Nexus management IP:
ssh admin@<nexus-mgmt-ip>
# In NX-OS exec mode:
configure terminal
# Then paste the contents of scripts/nexus/01-apply-tuning.nxos
# Expected outputs vary; the script disables EEE, sets MTU 9216, applies QoS DSCP AF41,
# and configures ECMP hash with UDP port. Confirm with:
show running-config interface Eth1/1
# (Eth1/1 is the trunk port to the UCS — replace with your interface)

Expected: trunk interface shows mtu 9216, flowcontrol receive off, flowcontrol send off.

Verify VLANs are trunked:

show vlan brief
# expect VLANs 20, 30, 99, 101–120, 200–209 in active state

2.2 Inject NGFW CA into the cluster

# On the UCS (where you cloned the repo):
kubectl create configmap ngfw-ca \
  --from-file=ca.crt=/path/to/ngfw-ca.pem \
  -n web-agents \
  --dry-run=client -o yaml | kubectl apply -f -

Expected: configmap/ngfw-ca created (or configured).

2.3 Apply the DUT overlay

kubectl apply -k k8s/dut/

This applies: - 10-ngfw-ca.yaml — confirms the ConfigMap is in place - 20-network-attachments.yaml — Multus NAD definitions (VLANs 20, 30, 99) with routes pointing at NGFW - 60-snmp-exporter.yaml — SNMP scrape job for the NGFW - 70-network-policy-dut.yaml — NetworkPolicy DUT mode - 85-node-tuning.yaml — host sysctls (UDP/TCP buffers, BBR, FQ) - 15-tls-decrypt-probe.yaml — independent TLS decrypt verification probe (PR #180) - 95-deployment-mode.yaml — deployment-mode ConfigMap (set to single-node) - 97-deployment-mode-rules.yaml + 98-topology-correlation.yaml + 99-slo-and-anomaly-rules.yaml — recording rules + alerts

2.4 Patch the agent deployments

kubectl patch deployment web-agent -n web-agents \
  --type=strategic-merge-patch \
  --patch-file=k8s/dut/40-playwright-patch.yaml

kubectl patch deployment k6-agent -n web-agents \
  --type=strategic-merge-patch \
  --patch-file=k8s/dut/50-k6-patch.yaml

These add net1 macvlan interface, the NGFW CA trust, and REJECT_INVALID_CERTS=true. Pods restart after the patch.

Wait for both deployments to be Ready:

kubectl rollout status deployment web-agent -n web-agents
kubectl rollout status deployment k6-agent -n web-agents

2.5 Verify the TLS Decrypt Probe agrees

kubectl logs -n web-agents -l app=tls-decrypt-probe --tail=20
# expect: "decrypt mode: on" or "decrypt mode: off" — never "unknown"

Open Grafana → "TLSStress.Art — TLS Decrypt Mode — NGFW Inspection Verification". The metric web_agent_tls_decrypt_active should be 1 (active) — meaning the NGFW is decrypting TLS successfully.

If the probe says 0 or "unknown", do not proceed — your NGFW is not decrypting (or the CA is wrong, or the route is missing). See Step 2 troubleshooting.

2.6 Step 2 checklist

  • Nexus shows VLANs trunked, MTU 9216, flow-control off
  • NGFW CA ConfigMap applied
  • DUT overlay applied — all k8s/dut/ resources are Created
  • Both agent patches applied, deployments rolled out
  • TLS Decrypt Probe reports active (Grafana shows 1)
  • No CrashLoopBackOff after the DUT overlay

Step 3 — Smoke test (15 minutes)

Goal: run the lightest plan (BASELINE-SMOKE-5M) and confirm the data flows through the entire pipeline.

3.1 Identify available plans

curl -sS http://<dashboard-ip>:3000/api/test-plans/catalog | jq -r '.plans[]|.identifier'

Expected output: 15 lines starting with BASELINE-SMOKE-5M, BASELINE-SLO-30M, etc.

3.2 Trigger the smoke test

For Phase 1 (current state), the trigger is via Dashboard UI. For automation, see the API documented in TEST_PLANS.md.

Browse to Dashboard → Test Plans → BASELINE-SMOKE-5M → Run. Confirm the run starts.

3.3 Watch the run live

In Grafana, open "TLSStress.Art — Test Plan Execution — Current Run" (after PR #190 merges) or "Test Plan Execution — Current Run". Watch:

  • Active phase counter advances through the plan phases
  • Fleet target vs actual matches the plan
  • p99 latency starts populating

In the Dashboard, Runs page should show entries appearing.

3.4 Verify the runs table is populating

kubectl exec -n web-agents <postgres-pod> -- psql -U postgres -d dashboard \
  -c "SELECT COUNT(*) FROM runs WHERE started_at > now() - interval '5 minutes';"
# expect: > 0 (depends on cycle interval; expect ~50–500 for a 5-min smoke)

3.5 Wait for plan completion

BASELINE-SMOKE-5M runs for 5 minutes. After completion, test_run_executions row gets endedAt and outcome.

3.6 Step 3 checklist

  • Plan started without errors
  • Grafana panels show live data during the run
  • runs table populated with > 50 rows
  • No agent in CrashLoopBackOff during the run
  • Plan completes (status completed in the executions table)
  • TLS Decrypt Probe stayed active throughout

Step 4 — First real measurement (35 minutes)

Goal: a 30-minute SLO-target run that produces a usable result.

4.1 Run BASELINE-SLO-30M

Same trigger pattern as the smoke test. Plan: BASELINE-SLO-30M. Duration: 30 minutes. SLO target: p99 ≤ 500 ms.

4.2 During the run — monitor

Open Grafana side-by-side: - "TLSStress.Art — Fleet Status — Deployment Aware" — confirm fleet at target - "TLSStress.Art — SLO + Burn-Rate + Anomaly Detection" — confirm no fast-burn alerts firing - "TLSStress.Art — Test-Bed Infrastructure Health" — confirm UCS is not the bottleneck (CPU < 70%, mem stable) - "TLSStress.Art — Topology Correlation — Where Is the Bottleneck" — should show the NGFW (DUT) as the constrained component, not your agents or personas

If your agents or personas saturate before the NGFW does, the test bed itself is the bottleneck and the result is meaningless. Stop and revisit Step 1 (size up the UCS or split topology to dual/tri/multi-node).

4.3 After the run — collect

The test_run_executions row now has endedAt and outcome. Hit GET /api/test-runs/<exec-id>/report.json and you have the structured report data (after PR #185 merges). For Phase 1 you can also browse /runs/<exec-id>/report for a print-styled HTML page.

4.4 Compare to vendor expectation

Take the observed p99 and compare to: - The vendor's published throughput specs for the NGFW model - Any prior benchmarks the customer has on file - The plan's slo_target_p99_seconds (the catalog declares the target)

A good first run lands within ±30% of the vendor's spec for matching traffic shape. Larger drift means: NGFW config issue, network bottleneck, or a real performance gap worth investigating.

4.5 Step 4 checklist

  • Plan completed successfully (outcome: completed, not aborted)
  • p99 result available and reasonable (not absurd values)
  • No alerts fired indicating test-bed self-bottleneck
  • TLS Decrypt Probe stayed active throughout
  • Report data accessible via API + UI
  • First measurement: documented — capture the run-id, plan-snapshot SHA-256, and observed p99 in your engagement notes

Step 1 troubleshooting

Symptom Likely cause Fix
Personas stuck Init:0/1 for > 5 min persona-seeder image not pulled or seeder crashes kubectl describe pod <persona-pod> — check init-container logs
Personas Pending forever Multus NAD not ready or PV not bound kubectl get networkattachmentdefinitions -A — verify NADs created
cert-manager-webhook timeouts webhook slow on first install Wait 90 s, retry. If persistent: kubectl rollout restart deployment cert-manager-webhook -n cert-manager
Dashboard 502 from ingress Postgres not ready kubectl logs -n web-agents postgres-0 — wait for "database system is ready to accept connections"

Step 2 troubleshooting

Symptom Likely cause Fix
TLS Decrypt Probe says unknown NGFW unreachable on data plane, or CA wrong Check kubectl logs -l app=tls-decrypt-probe, verify routes via kubectl exec into a persona and ip route, confirm NGFW management is up
TLS Decrypt Probe says 0 (off) NGFW present but not decrypting Check NGFW decrypt-policy is enabled, points to the right cert, applies to the relevant interfaces
Macvlan not coming up Host network MTU mismatch or VLAN not trunked nmcli / ip a on host + show interface on Nexus — MTU 9216 both sides
Persistent connection refused from agent Persona DNS resolves to NGFW IP but route not via macvlan Verify route in agent: kubectl exec <agent> -- ip route get 10.1.5.10 should show net1

Step 3 / Step 4 troubleshooting

Symptom Likely cause Fix
runs table empty during a run Agents are not running cycles kubectl logs -l app=web-agent — look for register failures or controller-token mismatch
Grafana panel "No data" mid-run scrape interval > query window OR query mistake Wait 30 s; if persistent, re-import dashboard JSON
p99 spike to absurd value (e.g. 30s) Some persona DNS unresolvable, agent times out Check kubectl logs <persona-pod> and the persona Service DNS
Plan status aborted Operator stopped run, or fleet readiness alert fired Read execution row's outcome field; check Alertmanager

After your first successful run

  1. Save the engagement notes — run-id, plan-snapshot SHA-256, observed p99, NGFW model + version, observed throughput.
  2. Tag the cluster if this was a significant baseline:
    kubectl annotate cluster <cluster> \
      tlsstress.art/baseline-run="<run-id>" \
      tlsstress.art/baseline-plan="BASELINE-SLO-30M" \
      tlsstress.art/baseline-p99-ms="<value>"
    
  3. Choose the next plan based on engagement objectives:
  4. Find capacity → CAP-FIND-KNEE-30M
  5. Sustained at 90% → CAP-MAX-1H
  6. Endurance → SOAK-ENDURANCE-24H
  7. Failure modes → STR-OVERLOAD-15M

See TEST_PLANS.md for plan selection guidance.