Runbook — First real installation of TLSStress.Art¶

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Audience: Operator who has been granted access via the Access Broker and is about to install TLSStress.Art on real hardware for the first time.

Onboarding sequence: Access → Clone → Install ← you are here · alternate: Air-gap install · maintainer setup: Private Repo Setup

Time budget: ~2 hours total. Each step is timed.

Why this runbook exists¶

The TLSStress.Art codebase has CI green, unit tests passing, and clean linting — but no individual installation has been validated end-to-end against a real Cisco Nexus 9000 + NGFW DUT prior to the operator running this. This runbook is the protocol that takes you from "fresh UCS + Nexus" to "first measured p99 in a Grafana panel" in a controlled, reproducible way.

If any step fails, stop and investigate rather than skipping ahead. Failures at each step have specific recovery procedures linked at the end.

Prerequisites — before step 1¶

Repository access granted via Access Broker (see ACCESS_REQUEST.md)
License accepted through the Dashboard's License Acceptance Modal (will appear on first login)
One Ubuntu 22.04+ UCS host with at least 32 GB RAM, 16 vCPU, 200 GB disk
One Cisco Nexus 9000 with management access via SSH and trunk port available
One NGFW DUT (Cisco FTD/ASA/Firepower, Palo Alto, Fortinet, or similar) with management IP reachable from the UCS, and trunk-attached to the Nexus
Network IPs reserved:
UCS OOBI (eth0): your management network (e.g. 192.168.90.10/24)
Nexus management IP for SNMP scrape: e.g. 192.168.90.2
NGFW management IP: e.g. 192.168.90.3
VLAN 20 (browser-engine agents): 172.16.0.0/16
VLAN 30 (synthetic-load agents): 172.17.0.0/16
VLANs 101–120 (Synthetic personas): 10.1.{1..20}.0/27, gateway = NGFW
VLANs 200–209 (Cloned personas): 10.2.{1..10}.0/27, gateway = NGFW
NGFW CA certificate in PEM format, available locally (you will inject it as a ConfigMap)
A .env file at the repo root with SNMP_NEXUS_HOST, SNMP_NGFW_HOST, SNMP_COMMUNITY, SNMP_DUT_MODULE (one of cisco_asa, cisco_firepower, palo_alto, fortinet, generic)

If any of these are missing, do not proceed. The runbook depends on all of them.

Step 1 — Lab pre-flight (60–90 minutes)¶

Goal: confirm the cluster comes up healthy with all 30 personas running, before any DUT involvement.

1.1 Clone the repository onto the UCS¶

Authenticated clone — see docs/CLONE_FOR_INSTALL.md for the four authentication options. Recommended for permanent installs: deploy key SSH (option B).

ssh ucs-1.example.com
mkdir -p ~/tlsstress && cd ~/tlsstress
# Option A — gh CLI (recommended for first install)
sudo apt update && sudo apt install -y gh
gh auth login   # follow prompts
gh repo clone nollagluiz/AI_forSE .

Expected: directory contains dashboard/, agent/, personas/, k8s/, platform/, docs/, etc. Total clone size around 80 MB.

1.2 Install K3s + Multus + cert-manager¶

sudo bash scripts/k8s-install.sh single-node

The script: - Installs K3s - Adds Multus CNI - Adds cert-manager - Labels the node role=ngfw-dut and dut-data-plane=true - Sets up persistent volume hostpath if needed

Expected (last few lines):

✅ K3s up
✅ Multus installed
✅ cert-manager ready
✅ node labels applied

Verify:

kubectl get nodes
# expect: 1 node, status Ready, with labels visible via:
kubectl get nodes -o wide --show-labels | grep dut-data-plane

1.3 Apply the base stack¶

kubectl apply -f k8s/
kubectl apply -k platform/        # cert-manager Issuers + persona-ca-issuer
kubectl apply -k personas/        # 20 Synthetic personas

Wait for personas to become Ready (this takes 2–5 minutes the first time as the seeder generates content):

watch -n 5 'kubectl get pods -A | grep persona-'
# proceed when 20 pods are 1/1 Running with status Ready

If a persona stays in Init:0/1 for more than 5 minutes, see Step 1 troubleshooting below.

1.4 Confirm the Dashboard is reachable¶

kubectl get svc -n web-agents dashboard
# note the ClusterIP or NodePort

# from the UCS:
curl -sI http://<dashboard-ip>:3000 | head -5
# expect: HTTP/1.1 302 Found  (redirect to /login)

# or via NodePort if you exposed one:
curl -sI http://<UCS-IP>:<NodePort>/login

Open the dashboard in a browser. Expected on first load: - License Acceptance Modal appears (intercept). Fill role + Cisco email + acknowledgements. Accept. - Dashboard renders with empty state — no agents yet. - TLSStress.Art wordmark in the top-left, license footer at the bottom of every page.

1.5 Confirm Prometheus + Grafana stack¶

kubectl get pods -n observability
# expect: prometheus-0, grafana-XXX, loki-XXX, alertmanager-XXX all Running

Open Grafana (default :3001). Log in with admin / admin.

Expected: - Welcome screen says "TLSStress.Art" (after PR #190 merges) - Dashboards list shows all 10 dashboards prefixed with "TLSStress.Art —" - "TLSStress.Art — Personas — Overview" shows up to 30 personas alive (20 Synthetic always; 10 Cloned slots when active)

1.6 Step 1 checklist before proceeding¶

All 20 Synthetic personas in 1/1 Running
Dashboard reachable, License Acceptance accepted, locale rendering correctly
Grafana shows 10 dashboards, none "No data"
Prometheus has scraping all targets (Status → Targets, all UP)
No CrashLoopBackOff in kubectl get pods -A

If any item fails — do not proceed.

Step 2 — DUT connection (30 minutes)¶

Goal: route data-plane traffic through the NGFW, with TLS decrypt active.

2.1 Apply Nexus 9000 tuning¶

# From a machine with reachability to the Nexus management IP:
ssh admin@<nexus-mgmt-ip>
# In NX-OS exec mode:
configure terminal
# Then paste the contents of scripts/nexus/01-apply-tuning.nxos
# Expected outputs vary; the script disables EEE, sets MTU 9216, applies QoS DSCP AF41,
# and configures ECMP hash with UDP port. Confirm with:
show running-config interface Eth1/1
# (Eth1/1 is the trunk port to the UCS — replace with your interface)

Expected: trunk interface shows mtu 9216, flowcontrol receive off, flowcontrol send off.

Verify VLANs are trunked:

show vlan brief
# expect VLANs 20, 30, 99, 101–120, 200–209 in active state

2.2 Inject NGFW CA into the cluster¶

# On the UCS (where you cloned the repo):
kubectl create configmap ngfw-ca \
  --from-file=ca.crt=/path/to/ngfw-ca.pem \
  -n web-agents \
  --dry-run=client -o yaml | kubectl apply -f -

Expected: configmap/ngfw-ca created (or configured).

2.3 Apply the DUT overlay¶

kubectl apply -k k8s/dut/

This applies: - 10-ngfw-ca.yaml — confirms the ConfigMap is in place - 20-network-attachments.yaml — Multus NAD definitions (VLANs 20, 30, 99) with routes pointing at NGFW - 60-snmp-exporter.yaml — SNMP scrape job for the NGFW - 70-network-policy-dut.yaml — NetworkPolicy DUT mode - 85-node-tuning.yaml — host sysctls (UDP/TCP buffers, BBR, FQ) - 15-tls-decrypt-probe.yaml — independent TLS decrypt verification probe (PR #180) - 95-deployment-mode.yaml — deployment-mode ConfigMap (set to single-node) - 97-deployment-mode-rules.yaml + 98-topology-correlation.yaml + 99-slo-and-anomaly-rules.yaml — recording rules + alerts

2.4 Patch the agent deployments¶

kubectl patch deployment web-agent -n web-agents \
  --type=strategic-merge-patch \
  --patch-file=k8s/dut/40-playwright-patch.yaml

kubectl patch deployment k6-agent -n web-agents \
  --type=strategic-merge-patch \
  --patch-file=k8s/dut/50-k6-patch.yaml

These add net1 macvlan interface, the NGFW CA trust, and REJECT_INVALID_CERTS=true. Pods restart after the patch.

Wait for both deployments to be Ready:

kubectl rollout status deployment web-agent -n web-agents
kubectl rollout status deployment k6-agent -n web-agents

2.5 Verify the TLS Decrypt Probe agrees¶

kubectl logs -n web-agents -l app=tls-decrypt-probe --tail=20
# expect: "decrypt mode: on" or "decrypt mode: off" — never "unknown"

Open Grafana → "TLSStress.Art — TLS Decrypt Mode — NGFW Inspection Verification". The metric web_agent_tls_decrypt_active should be 1 (active) — meaning the NGFW is decrypting TLS successfully.

If the probe says 0 or "unknown", do not proceed — your NGFW is not decrypting (or the CA is wrong, or the route is missing). See Step 2 troubleshooting.

2.6 Step 2 checklist¶

Nexus shows VLANs trunked, MTU 9216, flow-control off
NGFW CA ConfigMap applied
DUT overlay applied — all k8s/dut/ resources are Created
Both agent patches applied, deployments rolled out
TLS Decrypt Probe reports active (Grafana shows 1)
No CrashLoopBackOff after the DUT overlay

Step 3 — Smoke test (15 minutes)¶

Goal: run the lightest plan (BASELINE-SMOKE-5M) and confirm the data flows through the entire pipeline.

3.1 Identify available plans¶

curl -sS http://<dashboard-ip>:3000/api/test-plans/catalog | jq -r '.plans[]|.identifier'

Expected output: 15 lines starting with BASELINE-SMOKE-5M, BASELINE-SLO-30M, etc.

3.2 Trigger the smoke test¶

For Phase 1 (current state), the trigger is via Dashboard UI. For automation, see the API documented in TEST_PLANS.md.

Browse to Dashboard → Test Plans → BASELINE-SMOKE-5M → Run. Confirm the run starts.

3.3 Watch the run live¶

In Grafana, open "TLSStress.Art — Test Plan Execution — Current Run" (after PR #190 merges) or "Test Plan Execution — Current Run". Watch:

Active phase counter advances through the plan phases
Fleet target vs actual matches the plan
p99 latency starts populating

In the Dashboard, Runs page should show entries appearing.

3.4 Verify the runs table is populating¶

kubectl exec -n web-agents <postgres-pod> -- psql -U postgres -d dashboard \
  -c "SELECT COUNT(*) FROM runs WHERE started_at > now() - interval '5 minutes';"
# expect: > 0 (depends on cycle interval; expect ~50–500 for a 5-min smoke)

3.5 Wait for plan completion¶

BASELINE-SMOKE-5M runs for 5 minutes. After completion, test_run_executions row gets endedAt and outcome.

3.6 Step 3 checklist¶

Plan started without errors
Grafana panels show live data during the run
runs table populated with > 50 rows
No agent in CrashLoopBackOff during the run
Plan completes (status completed in the executions table)
TLS Decrypt Probe stayed active throughout

Step 4 — First real measurement (35 minutes)¶

Goal: a 30-minute SLO-target run that produces a usable result.

4.1 Run BASELINE-SLO-30M¶

Same trigger pattern as the smoke test. Plan: BASELINE-SLO-30M. Duration: 30 minutes. SLO target: p99 ≤ 500 ms.

4.2 During the run — monitor¶

Open Grafana side-by-side: - "TLSStress.Art — Fleet Status — Deployment Aware" — confirm fleet at target - "TLSStress.Art — SLO + Burn-Rate + Anomaly Detection" — confirm no fast-burn alerts firing - "TLSStress.Art — Test-Bed Infrastructure Health" — confirm UCS is not the bottleneck (CPU < 70%, mem stable) - "TLSStress.Art — Topology Correlation — Where Is the Bottleneck" — should show the NGFW (DUT) as the constrained component, not your agents or personas

If your agents or personas saturate before the NGFW does, the test bed itself is the bottleneck and the result is meaningless. Stop and revisit Step 1 (size up the UCS or split topology to dual/tri/multi-node).

4.3 After the run — collect¶

The test_run_executions row now has endedAt and outcome. Hit GET /api/test-runs/<exec-id>/report.json and you have the structured report data (after PR #185 merges). For Phase 1 you can also browse /runs/<exec-id>/report for a print-styled HTML page.

4.4 Compare to vendor expectation¶

Take the observed p99 and compare to: - The vendor's published throughput specs for the NGFW model - Any prior benchmarks the customer has on file - The plan's slo_target_p99_seconds (the catalog declares the target)

A good first run lands within ±30% of the vendor's spec for matching traffic shape. Larger drift means: NGFW config issue, network bottleneck, or a real performance gap worth investigating.

4.5 Step 4 checklist¶

Plan completed successfully (outcome: completed, not aborted)
p99 result available and reasonable (not absurd values)
No alerts fired indicating test-bed self-bottleneck
TLS Decrypt Probe stayed active throughout
Report data accessible via API + UI
First measurement: documented — capture the run-id, plan-snapshot SHA-256, and observed p99 in your engagement notes

Step 1 troubleshooting¶

Symptom	Likely cause	Fix
Personas stuck `Init:0/1` for > 5 min	persona-seeder image not pulled or seeder crashes	`kubectl describe pod <persona-pod>` — check init-container logs
Personas Pending forever	Multus NAD not ready or PV not bound	`kubectl get networkattachmentdefinitions -A` — verify NADs created
cert-manager-webhook timeouts	webhook slow on first install	Wait 90 s, retry. If persistent: `kubectl rollout restart deployment cert-manager-webhook -n cert-manager`
Dashboard `502` from ingress	Postgres not ready	`kubectl logs -n web-agents postgres-0` — wait for "database system is ready to accept connections"

Step 2 troubleshooting¶

Symptom	Likely cause	Fix
TLS Decrypt Probe says `unknown`	NGFW unreachable on data plane, or CA wrong	Check `kubectl logs -l app=tls-decrypt-probe`, verify routes via `kubectl exec` into a persona and `ip route`, confirm NGFW management is up
TLS Decrypt Probe says `0` (off)	NGFW present but not decrypting	Check NGFW decrypt-policy is enabled, points to the right cert, applies to the relevant interfaces
Macvlan not coming up	Host network MTU mismatch or VLAN not trunked	`nmcli` / `ip a` on host + `show interface` on Nexus — MTU 9216 both sides
Persistent `connection refused` from agent	Persona DNS resolves to NGFW IP but route not via macvlan	Verify route in agent: `kubectl exec <agent> -- ip route get 10.1.5.10` should show net1

Step 3 / Step 4 troubleshooting¶

Symptom	Likely cause	Fix
`runs` table empty during a run	Agents are not running cycles	`kubectl logs -l app=web-agent` — look for register failures or controller-token mismatch
Grafana panel "No data" mid-run	scrape interval > query window OR query mistake	Wait 30 s; if persistent, re-import dashboard JSON
p99 spike to absurd value (e.g. 30s)	Some persona DNS unresolvable, agent times out	Check `kubectl logs <persona-pod>` and the persona Service DNS
Plan status `aborted`	Operator stopped run, or fleet readiness alert fired	Read execution row's `outcome` field; check Alertmanager

After your first successful run¶

Save the engagement notes — run-id, plan-snapshot SHA-256, observed p99, NGFW model + version, observed throughput.

Tag the cluster if this was a significant baseline:

kubectl annotate cluster <cluster> \
  tlsstress.art/baseline-run="<run-id>" \
  tlsstress.art/baseline-plan="BASELINE-SLO-30M" \
  tlsstress.art/baseline-p99-ms="<value>"

Choose the next plan based on engagement objectives:
Find capacity → CAP-FIND-KNEE-30M
Sustained at 90% → CAP-MAX-1H
Endurance → SOAK-ENDURANCE-24H
Failure modes → STR-OVERLOAD-15M

See TEST_PLANS.md for plan selection guidance.

PRIVATE_REPO_SETUP.md — repository visibility + branch protection
ACCESS_REQUEST.md — how operators got access to clone
CLONE_FOR_INSTALL.md — the four authentication options for git clone
TEST_PLANS.md — the 15 catalog plans
MONITORING_TEST_VALIDITY.md — alerts that flag when the test bed itself is the bottleneck
TLS_DECRYPT_MODE_VERIFICATION.en.md — the issuer probe used in step 2.5
USAGE_POLICY.md — what counts as authorized use during this run