Distributed tracing — Tempo + OpenTelemetry¶
Read in your language: English · Português · Español
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.
Status: Tempo OTLP receiver enabled (gRPC
:4317+ HTTP:4318). browser-engine + synthetic-load agents have OpenTelemetry SDK code shipped (agent/src/telemetry.ts), opt-in viaoverlays/tracing/. Caddy persona instrumentation pending follow-up (requires xcaddy rebuild withcaddy-otelmodule).
Why distributed tracing here¶
The headline measurement of this test bed is the per-request latency experienced when an agent (browser engine or synthetic-load engine) talks to a persona webserver through the NGFW under test. A single request crosses four hops:
browser-engine/synthetic-load agent → NGFW (DUT, TLS leg 1) → NGFW (TLS leg 2) → Caddy persona
hop 1 hop 2 hop 3 hop 4
Without distributed tracing, the operator only sees the black-box end-to-end latency from the agent's perspective. A regression of 50 ms could be in the NGFW, in the Caddy persona, or in the Linux kernel between them — Prometheus metrics alone cannot tell you which.
Distributed tracing surfaces the per-hop breakdown, which is exactly what commercial appliances (Spirent CyberFlood, Ixia BreakingPoint) do not expose. This is a strategic differentiator for this project — and the reason we ship the SDK code at the agent level even though we keep it OPT-IN by default.
Opt-in by default — performance footprint considerations¶
Distributed tracing is disabled by default. The agents run with their normal memory + CPU footprint until the operator explicitly enables it.
Why opt-in: the agent fleet is designed for high agent counts (browser engine 1–300, synthetic-load engine 1–1000) on lab hardware that may have memory/CPU constraints. We don't want every operator to pay the SDK cost for every test run; we want them to choose to pay it during baseline measurement and deep-dive runs.
Footprint when ENABLED (measured on Ubuntu 22.04 + Node 20):
| Resource | Per browser-engine agent | Per synthetic-load agent |
|---|---|---|
| RSS memory | +10–15 MB | +5–10 MB |
| CPU steady state | <1% | <1% |
| Per-request latency added | ~50–100 µs | ~50–100 µs |
| Per-request latency overhead vs 200 ms baseline | ~0.05% | ~0.05% |
Mitigations baked into the overlay:
- Probabilistic sampling at 10% (OTEL_TRACES_SAMPLER_ARG=0.1). At peak fleet (300 browser engine × 1 cycle/5 s) that is ~6 traces/s reaching Tempo
- @opentelemetry/instrumentation-fs disabled — fs is noisy and irrelevant
- HTTP exporter (not gRPC) — keep-alive, no extra binary deps
- Bounded BatchSpanProcessor queue (OTEL_BSP_MAX_QUEUE_SIZE=2048) so memory is predictable
- OTEL_SDK_DISABLED=true kill-switch always honoured
How to enable¶
After bringing up the cluster in any deployment mode:
sudo ./scripts/k8s-install.sh --mode=<single|dual|tri|multi> ...
# Toggle tracing on with the helper script. The script patches the
# running web-agent + k6-agent Deployments with OTEL_* env vars and
# rotates the pods automatically.
./scripts/tracing-toggle.sh on
Override defaults via environment:
TRACING_SAMPLE_RATE=1.0 ./scripts/tracing-toggle.sh on # 100% sampling
TEMPO_HOST=tempo.observability.svc ./scripts/tracing-toggle.sh on # custom Tempo
Verify the SDK booted (look for the [telemetry] OpenTelemetry SDK started line):
kubectl logs -n web-agents deploy/web-agent | grep telemetry
kubectl logs -n web-agents deploy/k6-agent | grep telemetry
Open Tempo UI to see traces:
kubectl port-forward -n web-agents svc/tempo 3200:3200
# then visit http://localhost:3200
How to disable¶
./scripts/tracing-toggle.sh off
This sets OTEL_SDK_DISABLED=true on both Deployments and restarts the pods. The SDK init in agent/src/telemetry.ts returns early when it sees this flag, so no overhead is paid until you toggle back on.
To check current state:
./scripts/tracing-toggle.sh status
What the trace looks like¶
When enabled, each browser engine cycle becomes a parent span:
[span] web-agent: cycle (1.2s)
├─ [span] http: GET https://shop.persona.internal/ (340 ms)
│ ├─ [span] dns: lookup shop.persona.internal (2 ms)
│ ├─ [span] tcp: connect 10.1.1.2:443 (8 ms)
│ ├─ [span] tls: handshake (45 ms) ← agent → NGFW (TLS leg 1)
│ └─ [span] http: response (285 ms) ← NGFW → persona (TLS leg 2 + content)
├─ [span] http: GET .../static/main.css (120 ms)
└─ ... per-resource child spans
Trace IDs propagate via standard W3C traceparent / tracestate headers. The NGFW must preserve these headers — most modern firewalls do; some QUIC inspection paths reset headers, so for HTTP/3 traces use HTTP/1.1 fallback or run with TLS decrypt ON so the NGFW reconstructs the request.
What is NOT traced (yet)¶
- Caddy personas: Caddy 2.7+ has a built-in tracing directive but the
caddy-otelmodule must be compiled into the image via xcaddy. Scheduled for a follow-up PR. Until that ships, the trace ends at the agent's HTTP response — you will see the request leave the agent and complete, but not the breakdown of what the persona did internally - NGFW spans: NGFW vendors don't typically emit OTLP. The NGFW shows up implicitly as time spent between the agent's HTTP send and the persona's HTTP response. To attribute that time precisely, either configure the NGFW's mirror port to a span-aware collector OR use the SNMP exporter's CPU + queue depth metrics as a coarse proxy
Tempo storage planning¶
With 10% sampling and a 1000-agent synthetic-load fleet running 100 RPS each: - Spans/s ingested: 100,000 × 0.1 ≈ 10,000 spans/s - Average span size: ~500 B - Bytes/s: ~5 MB/s - Storage for 24 h: ~430 GB
Defaults in observability/tempo/tempo.yml:
- block_retention: 24h
- max_block_bytes: 100 MiB
- Storage backend: local (PVC)
Increase block_retention to 7d for longer retention; provision the PVC accordingly.
Roadmap¶
| Phase | Status | What it covers |
|---|---|---|
| Tempo OTLP receiver enabled | ✅ shipped (PR-4 #179) | gRPC :4317 + HTTP :4318 |
| Agent OpenTelemetry SDK code | ✅ shipped (this PR) | browser-engine + synthetic-load wrappers |
| Opt-in tracing overlay | ✅ shipped (this PR) | overlays/tracing/ |
| Caddy persona OTel module | ⏳ follow-up | Image rebuild via xcaddy + Caddyfile tracing directive |
| NGFW span correlation | ⏳ research | Vendor-specific; potentially via SNMP queue-depth as a proxy |
| Trace-driven dashboards | ⏳ follow-up | Service map auto-built from spans, Tempo metrics_generator already wired |
See also¶
MONITORING_TEST_VALIDITY.md— broader observability contextSYSTEM_OVERVIEW.md— architecture reference- Grafana Tempo docs
- OpenTelemetry SDK reference
agent/src/telemetry.ts— actual SDK initialisation codescripts/tracing-toggle.sh— opt-in helper (on/off/status)