Monitoring test validity¶
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.
Goal of this document: in seconds, not hours, an operator should be able to answer the question "is the NGFW measurement I just collected actually meaningful, or was the test bed itself the bottleneck?"
The cluster's purpose is to load a Next-Generation Firewall and measure how much of the request rate / latency degradation is the NGFW's TLS-decryption cost. That's only true if the load-generating infrastructure (UCS hosts, Linux kernel, Caddy webservers, browser-engine + synthetic-load fleets, Nexus 9000 trunk, in-cluster NFS) keeps headroom while the test runs. If the host or fleet saturates first, every NGFW number printed by the dashboard is wrong.
This page describes the metrics, alerts, and Grafana dashboard that exist to make that condition obvious.
The four signals¶
| Signal | "Test invalidated" threshold | Where it shows up |
|---|---|---|
| TestBedInfrastructureBottleneck | persona p99 > 1 s and any host CPU > 85 %, or any HPA at maxReplicas | Grafana "Test bed status" stat panel turns red. PrometheusAlert fires severity=critical. |
| HostUDPBufferOverflow | UDP RcvbufErrors > 0 for 2 min on any host | Any HTTP/3 timing measurement involving that host is invalid (kernel dropped QUIC packets). |
| HostConntrackNearFull | Conntrack table > 80 % full | All measurements involving that host are invalid (kernel will drop packets imminently). |
| HostNetworkPacketDrops | > 100 packets/s dropped on a real NIC | RTT and throughput numbers from that host are wrong. |
Any of these alerting invalidates the in-flight test run. Stop the run, fix the underlying issue, re-run.
What you look at first¶
Open the Grafana dashboard Test-Bed Infrastructure Health (UID ai-forse-infra-health). Top row, four big stat panels:
| Panel | Green means | Red means |
|---|---|---|
| Test bed status | persona latency stable AND host CPU < 85 % AND no HPA pinned | Run is suspect. Investigate the next three panels. |
| Hosts with tuning missing | All hosts ran host-tuning.sh apply |
At least one host is at default sysctls (conntrack 262 K, ports 28 K) — that host will saturate first. |
| HPAs at maxReplicas | None pinned | Fleet is at full capacity; test bed is the limit, not the NGFW. |
| OOMKills (15 m) | Zero | At least one container hit its memory limit. If that's a persona Caddy, the run is invalid. |
If everything is green and the alerts panel is empty, the NGFW measurements are honest — keep running.
If anything is red, the rows below the stat panels show where to look: - Per-host CPU saturation — line chart per node, threshold red at 90 % - Memory + conntrack pressure — same per-node breakdown - Network drops / UDP overflow / TCP retransmits — three side-by-side charts; non-zero on any of them is bad - Pod CPU throttle + restarts — confirms which container is the actual constraint - Sysctl effective values — table, lets you spot the host that didn't get tuned
What gets monitored, and how¶
Three new components ship with this stack:
node_exporter(k8s/dut/90-node-exporter.yaml) — a DaemonSet that runs on every UCS node and exports kernel-level metrics on:9100over the OOBI control plane. Without this DaemonSet there is no host-level visibility at all.kube-state-metrics(k8s/dut/91-kube-state-metrics.yaml) — a Deployment that exposes Kubernetes object state (pod restart counts, OOMKill reasons, HPA replicas, etc.) as Prometheus metrics. Required for half of the existing alerts to fire at all.- PrometheusRule
web-agent-cluster-infra-health(k8s/dut/92-infra-prometheus-rules.yaml) — 14 alerts split across host, pod, and "test validity" groups, plus 7 recording rules that pre-aggregate the heavy joins so evaluation stays fast at a 30 s interval.
The dashboard config map is at platform/observability/dashboards/infra-health-cm.yaml — Grafana picks it up via the grafana_dashboard: "1" label as long as the standard dashboard sidecar is running.
Confirming tuning is live¶
Two complementary checks:
# Per-host, from the operator workstation (k3s example)
sudo scripts/host-tuning.sh status
# Cluster-wide, from Prometheus
sum(node_nf_conntrack_entries_limit) by (instance) > 1000000
Both should show every host at the tuned value (2 097 152 for conntrack). If a host is at the kernel default (~262 K), the TestBedSysctlMissing alert will fire automatically — no need to remember to look.
Interpreting alerts during a run¶
A typical NGFW capacity test produces the following signals when the NGFW is the bottleneck (the goal):
- Persona p99 latency rises smoothly with VU count
- Persona CPU rises but stays below 70 %
- Host CPU < 50 %, conntrack < 30 % full
- No HPA at max
- Zero TestBedInfrastructureBottleneck firings
The same signals when the test bed is the bottleneck (failure mode):
- Persona p99 latency rises sharply at low VU count
- A single host (or fleet) hits its ceiling before persona Caddy CPU does
- TestBedInfrastructureBottleneck fires within 3 minutes of starting
- HPAMaxedOut fires
- Often coincides with HostUDPBufferOverflow or HostConntrackNearFull
The alerts are designed so the second case is impossible to miss.
Coverage extended in PR-V¶
PR-U (#167) shipped the host + pod baseline. PR-V closes the gaps that survived:
| New ServiceMonitor / Probe / Rule | What you can now see / alert on |
|---|---|
Probe snmp-nexus9000 + snmp-ngfw-dut (k8s/dut/61-snmp-probes.yaml) |
NGFW (SUT) and Nexus 9000 trunk metrics actually scraped — without these the project was blind to the very devices it exists to measure. |
ServiceMonitor kubelet + cAdvisor (k8s/dut/93-kubelet-cadvisor.yaml) |
container_cpu_*, container_memory_*, container_network_*, container_fs_* — real container resource usage. The dashboard's "Top 20 by CPU/RSS" panels need this. |
ServiceMonitor reloader (k8s/87-stakater-reloader.yaml) |
reloader_reload_executed_total — silent failures of cert rotation. |
PrometheusRule extended-health (k8s/dut/94-extended-prometheus-rules.yaml) |
16 new alerts in 7 groups (NIC, disk, system pressure, coverage, NFS, SNMP, reloader). |
The 16 new alerts (extended set)¶
| Group | Alerts |
|---|---|
| NIC | NICLinkDown, NICCarrierFlapping, NICRingBufferOverflow |
| Disk | DiskIOLatencyHigh, DiskNearFull |
| System | LoadAverageHigh, FileDescriptorExhaustion, NTPClockSkew |
| Coverage | NodeTuningDaemonSetIncomplete, NodeExporterCoverageIncomplete, CNIDHCPDaemonIncomplete |
| NFS | NFSClientOperationFailures, NFSClientHighLatency |
| SNMP | SNMPProbeFailure, SNMPProbeSlow |
| Reloader | ReloaderRolloutFailure |
Total alert + panel count after PR-V¶
| Before PR-U | After PR-U | After PR-V | |
|---|---|---|---|
| Host + pod alerts | 0 | 15 | 31 |
| Composite test-validity alerts | 0 | 3 | 3 |
| Grafana dashboard rows | 5 (persona-only) | 6 | 10 |
| ServiceMonitors | 1 | 3 | 6 |
| Probes (SNMP) | 0 | 0 | 2 |
New dashboard panels (Test-Bed Infrastructure Health)¶
- NGFW + Nexus SNMP probes up — single stat that turns red when the SUT or trunk is invisible
- SNMP probe duration — per-device timeseries; warns at 25 s (timeout = 30 s)
- NIC link state — 1/0 per real NIC, drops to 0 instantly on cable / SFP failure
- Disk I/O busy % — per device, > 50 % sustained = NFS server bottleneck
- File descriptors used % — flags FD exhaustion before it crashes pods
- Top 20 containers by CPU / RSS — cAdvisor; spot the heaviest container vs its limit
- NFS RPC error rate / latency — slot-side view of cloned-sites delivery
See also¶
PERFORMANCE_TUNING_HOST.md— the host knobs the alerts validatescripts/host-tuning.sh— one-shot apply / status / removek8s/70-prometheus-rules.yaml— the existing agent + cloner alertsplatform/observability/rules/personas-rules.yaml— the existing persona Caddy alerts
The infra-health rule complements those — none of them duplicate, all of them are needed.