Monitoring test validity¶

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Goal of this document: in seconds, not hours, an operator should be able to answer the question "is the NGFW measurement I just collected actually meaningful, or was the test bed itself the bottleneck?"

The cluster's purpose is to load a Next-Generation Firewall and measure how much of the request rate / latency degradation is the NGFW's TLS-decryption cost. That's only true if the load-generating infrastructure (UCS hosts, Linux kernel, Caddy webservers, browser-engine + synthetic-load fleets, Nexus 9000 trunk, in-cluster NFS) keeps headroom while the test runs. If the host or fleet saturates first, every NGFW number printed by the dashboard is wrong.

This page describes the metrics, alerts, and Grafana dashboard that exist to make that condition obvious.

The four signals¶

Signal	"Test invalidated" threshold	Where it shows up
TestBedInfrastructureBottleneck	persona p99 > 1 s and any host CPU > 85 %, or any HPA at maxReplicas	Grafana "Test bed status" stat panel turns red. PrometheusAlert fires `severity=critical`.
HostUDPBufferOverflow	UDP RcvbufErrors > 0 for 2 min on any host	Any HTTP/3 timing measurement involving that host is invalid (kernel dropped QUIC packets).
HostConntrackNearFull	Conntrack table > 80 % full	All measurements involving that host are invalid (kernel will drop packets imminently).
HostNetworkPacketDrops	> 100 packets/s dropped on a real NIC	RTT and throughput numbers from that host are wrong.

Any of these alerting invalidates the in-flight test run. Stop the run, fix the underlying issue, re-run.

What you look at first¶

Open the Grafana dashboard Test-Bed Infrastructure Health (UID ai-forse-infra-health). Top row, four big stat panels:

Panel	Green means	Red means
Test bed status	persona latency stable AND host CPU < 85 % AND no HPA pinned	Run is suspect. Investigate the next three panels.
Hosts with tuning missing	All hosts ran `host-tuning.sh apply`	At least one host is at default sysctls (conntrack 262 K, ports 28 K) — that host will saturate first.
HPAs at maxReplicas	None pinned	Fleet is at full capacity; test bed is the limit, not the NGFW.
OOMKills (15 m)	Zero	At least one container hit its memory limit. If that's a persona Caddy, the run is invalid.

If everything is green and the alerts panel is empty, the NGFW measurements are honest — keep running.

If anything is red, the rows below the stat panels show where to look: - Per-host CPU saturation — line chart per node, threshold red at 90 % - Memory + conntrack pressure — same per-node breakdown - Network drops / UDP overflow / TCP retransmits — three side-by-side charts; non-zero on any of them is bad - Pod CPU throttle + restarts — confirms which container is the actual constraint - Sysctl effective values — table, lets you spot the host that didn't get tuned

What gets monitored, and how¶

Three new components ship with this stack:

node_exporter (k8s/dut/90-node-exporter.yaml) — a DaemonSet that runs on every UCS node and exports kernel-level metrics on :9100 over the OOBI control plane. Without this DaemonSet there is no host-level visibility at all.
kube-state-metrics (k8s/dut/91-kube-state-metrics.yaml) — a Deployment that exposes Kubernetes object state (pod restart counts, OOMKill reasons, HPA replicas, etc.) as Prometheus metrics. Required for half of the existing alerts to fire at all.
PrometheusRule web-agent-cluster-infra-health (k8s/dut/92-infra-prometheus-rules.yaml) — 14 alerts split across host, pod, and "test validity" groups, plus 7 recording rules that pre-aggregate the heavy joins so evaluation stays fast at a 30 s interval.

The dashboard config map is at platform/observability/dashboards/infra-health-cm.yaml — Grafana picks it up via the grafana_dashboard: "1" label as long as the standard dashboard sidecar is running.

Confirming tuning is live¶

Two complementary checks:

# Per-host, from the operator workstation (k3s example)
sudo scripts/host-tuning.sh status

# Cluster-wide, from Prometheus
sum(node_nf_conntrack_entries_limit) by (instance) > 1000000

Both should show every host at the tuned value (2 097 152 for conntrack). If a host is at the kernel default (~262 K), the TestBedSysctlMissing alert will fire automatically — no need to remember to look.

Interpreting alerts during a run¶

A typical NGFW capacity test produces the following signals when the NGFW is the bottleneck (the goal):

Persona p99 latency rises smoothly with VU count
Persona CPU rises but stays below 70 %
Host CPU < 50 %, conntrack < 30 % full
No HPA at max
Zero TestBedInfrastructureBottleneck firings

The same signals when the test bed is the bottleneck (failure mode):

Persona p99 latency rises sharply at low VU count
A single host (or fleet) hits its ceiling before persona Caddy CPU does
TestBedInfrastructureBottleneck fires within 3 minutes of starting
HPAMaxedOut fires
Often coincides with HostUDPBufferOverflow or HostConntrackNearFull

The alerts are designed so the second case is impossible to miss.

Coverage extended in PR-V¶

PR-U (#167) shipped the host + pod baseline. PR-V closes the gaps that survived:

New ServiceMonitor / Probe / Rule	What you can now see / alert on
`Probe` snmp-nexus9000 + snmp-ngfw-dut (`k8s/dut/61-snmp-probes.yaml`)	NGFW (SUT) and Nexus 9000 trunk metrics actually scraped — without these the project was blind to the very devices it exists to measure.
`ServiceMonitor` kubelet + cAdvisor (`k8s/dut/93-kubelet-cadvisor.yaml`)	`container_cpu_`, `container_memory_`, `container_network_`, `container_fs_` — real container resource usage. The dashboard's "Top 20 by CPU/RSS" panels need this.
`ServiceMonitor` reloader (`k8s/87-stakater-reloader.yaml`)	`reloader_reload_executed_total` — silent failures of cert rotation.
PrometheusRule `extended-health` (`k8s/dut/94-extended-prometheus-rules.yaml`)	16 new alerts in 7 groups (NIC, disk, system pressure, coverage, NFS, SNMP, reloader).

The 16 new alerts (extended set)¶

Group	Alerts
NIC	NICLinkDown, NICCarrierFlapping, NICRingBufferOverflow
Disk	DiskIOLatencyHigh, DiskNearFull
System	LoadAverageHigh, FileDescriptorExhaustion, NTPClockSkew
Coverage	NodeTuningDaemonSetIncomplete, NodeExporterCoverageIncomplete, CNIDHCPDaemonIncomplete
NFS	NFSClientOperationFailures, NFSClientHighLatency
SNMP	SNMPProbeFailure, SNMPProbeSlow
Reloader	ReloaderRolloutFailure

Total alert + panel count after PR-V¶

	Before PR-U	After PR-U	After PR-V
Host + pod alerts	0	15	31
Composite test-validity alerts	0	3	3
Grafana dashboard rows	5 (persona-only)	6	10
ServiceMonitors	1	3	6
Probes (SNMP)	0	0	2

New dashboard panels (`Test-Bed Infrastructure Health`)¶

NGFW + Nexus SNMP probes up — single stat that turns red when the SUT or trunk is invisible
SNMP probe duration — per-device timeseries; warns at 25 s (timeout = 30 s)
NIC link state — 1/0 per real NIC, drops to 0 instantly on cable / SFP failure
Disk I/O busy % — per device, > 50 % sustained = NFS server bottleneck
File descriptors used % — flags FD exhaustion before it crashes pods
Top 20 containers by CPU / RSS — cAdvisor; spot the heaviest container vs its limit
NFS RPC error rate / latency — slot-side view of cloned-sites delivery