Skip to content

Monitoring test validity

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Goal of this document: in seconds, not hours, an operator should be able to answer the question "is the NGFW measurement I just collected actually meaningful, or was the test bed itself the bottleneck?"

The cluster's purpose is to load a Next-Generation Firewall and measure how much of the request rate / latency degradation is the NGFW's TLS-decryption cost. That's only true if the load-generating infrastructure (UCS hosts, Linux kernel, Caddy webservers, browser-engine + synthetic-load fleets, Nexus 9000 trunk, in-cluster NFS) keeps headroom while the test runs. If the host or fleet saturates first, every NGFW number printed by the dashboard is wrong.

This page describes the metrics, alerts, and Grafana dashboard that exist to make that condition obvious.


The four signals

Signal "Test invalidated" threshold Where it shows up
TestBedInfrastructureBottleneck persona p99 > 1 s and any host CPU > 85 %, or any HPA at maxReplicas Grafana "Test bed status" stat panel turns red. PrometheusAlert fires severity=critical.
HostUDPBufferOverflow UDP RcvbufErrors > 0 for 2 min on any host Any HTTP/3 timing measurement involving that host is invalid (kernel dropped QUIC packets).
HostConntrackNearFull Conntrack table > 80 % full All measurements involving that host are invalid (kernel will drop packets imminently).
HostNetworkPacketDrops > 100 packets/s dropped on a real NIC RTT and throughput numbers from that host are wrong.

Any of these alerting invalidates the in-flight test run. Stop the run, fix the underlying issue, re-run.

What you look at first

Open the Grafana dashboard Test-Bed Infrastructure Health (UID ai-forse-infra-health). Top row, four big stat panels:

Panel Green means Red means
Test bed status persona latency stable AND host CPU < 85 % AND no HPA pinned Run is suspect. Investigate the next three panels.
Hosts with tuning missing All hosts ran host-tuning.sh apply At least one host is at default sysctls (conntrack 262 K, ports 28 K) — that host will saturate first.
HPAs at maxReplicas None pinned Fleet is at full capacity; test bed is the limit, not the NGFW.
OOMKills (15 m) Zero At least one container hit its memory limit. If that's a persona Caddy, the run is invalid.

If everything is green and the alerts panel is empty, the NGFW measurements are honest — keep running.

If anything is red, the rows below the stat panels show where to look: - Per-host CPU saturation — line chart per node, threshold red at 90 % - Memory + conntrack pressure — same per-node breakdown - Network drops / UDP overflow / TCP retransmits — three side-by-side charts; non-zero on any of them is bad - Pod CPU throttle + restarts — confirms which container is the actual constraint - Sysctl effective values — table, lets you spot the host that didn't get tuned

What gets monitored, and how

Three new components ship with this stack:

  1. node_exporter (k8s/dut/90-node-exporter.yaml) — a DaemonSet that runs on every UCS node and exports kernel-level metrics on :9100 over the OOBI control plane. Without this DaemonSet there is no host-level visibility at all.
  2. kube-state-metrics (k8s/dut/91-kube-state-metrics.yaml) — a Deployment that exposes Kubernetes object state (pod restart counts, OOMKill reasons, HPA replicas, etc.) as Prometheus metrics. Required for half of the existing alerts to fire at all.
  3. PrometheusRule web-agent-cluster-infra-health (k8s/dut/92-infra-prometheus-rules.yaml) — 14 alerts split across host, pod, and "test validity" groups, plus 7 recording rules that pre-aggregate the heavy joins so evaluation stays fast at a 30 s interval.

The dashboard config map is at platform/observability/dashboards/infra-health-cm.yaml — Grafana picks it up via the grafana_dashboard: "1" label as long as the standard dashboard sidecar is running.

Confirming tuning is live

Two complementary checks:

# Per-host, from the operator workstation (k3s example)
sudo scripts/host-tuning.sh status

# Cluster-wide, from Prometheus
sum(node_nf_conntrack_entries_limit) by (instance) > 1000000

Both should show every host at the tuned value (2 097 152 for conntrack). If a host is at the kernel default (~262 K), the TestBedSysctlMissing alert will fire automatically — no need to remember to look.

Interpreting alerts during a run

A typical NGFW capacity test produces the following signals when the NGFW is the bottleneck (the goal):

  • Persona p99 latency rises smoothly with VU count
  • Persona CPU rises but stays below 70 %
  • Host CPU < 50 %, conntrack < 30 % full
  • No HPA at max
  • Zero TestBedInfrastructureBottleneck firings

The same signals when the test bed is the bottleneck (failure mode):

  • Persona p99 latency rises sharply at low VU count
  • A single host (or fleet) hits its ceiling before persona Caddy CPU does
  • TestBedInfrastructureBottleneck fires within 3 minutes of starting
  • HPAMaxedOut fires
  • Often coincides with HostUDPBufferOverflow or HostConntrackNearFull

The alerts are designed so the second case is impossible to miss.

Coverage extended in PR-V

PR-U (#167) shipped the host + pod baseline. PR-V closes the gaps that survived:

New ServiceMonitor / Probe / Rule What you can now see / alert on
Probe snmp-nexus9000 + snmp-ngfw-dut (k8s/dut/61-snmp-probes.yaml) NGFW (SUT) and Nexus 9000 trunk metrics actually scraped — without these the project was blind to the very devices it exists to measure.
ServiceMonitor kubelet + cAdvisor (k8s/dut/93-kubelet-cadvisor.yaml) container_cpu_*, container_memory_*, container_network_*, container_fs_* — real container resource usage. The dashboard's "Top 20 by CPU/RSS" panels need this.
ServiceMonitor reloader (k8s/87-stakater-reloader.yaml) reloader_reload_executed_total — silent failures of cert rotation.
PrometheusRule extended-health (k8s/dut/94-extended-prometheus-rules.yaml) 16 new alerts in 7 groups (NIC, disk, system pressure, coverage, NFS, SNMP, reloader).

The 16 new alerts (extended set)

Group Alerts
NIC NICLinkDown, NICCarrierFlapping, NICRingBufferOverflow
Disk DiskIOLatencyHigh, DiskNearFull
System LoadAverageHigh, FileDescriptorExhaustion, NTPClockSkew
Coverage NodeTuningDaemonSetIncomplete, NodeExporterCoverageIncomplete, CNIDHCPDaemonIncomplete
NFS NFSClientOperationFailures, NFSClientHighLatency
SNMP SNMPProbeFailure, SNMPProbeSlow
Reloader ReloaderRolloutFailure

Total alert + panel count after PR-V

Before PR-U After PR-U After PR-V
Host + pod alerts 0 15 31
Composite test-validity alerts 0 3 3
Grafana dashboard rows 5 (persona-only) 6 10
ServiceMonitors 1 3 6
Probes (SNMP) 0 0 2

New dashboard panels (Test-Bed Infrastructure Health)

  • NGFW + Nexus SNMP probes up — single stat that turns red when the SUT or trunk is invisible
  • SNMP probe duration — per-device timeseries; warns at 25 s (timeout = 30 s)
  • NIC link state — 1/0 per real NIC, drops to 0 instantly on cable / SFP failure
  • Disk I/O busy % — per device, > 50 % sustained = NFS server bottleneck
  • File descriptors used % — flags FD exhaustion before it crashes pods
  • Top 20 containers by CPU / RSS — cAdvisor; spot the heaviest container vs its limit
  • NFS RPC error rate / latency — slot-side view of cloned-sites delivery

See also

The infra-health rule complements those — none of them duplicate, all of them are needed.