Distributed tracing — Tempo + OpenTelemetry¶

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Status: Tempo OTLP receiver enabled (gRPC :4317 + HTTP :4318). browser-engine + synthetic-load agents have OpenTelemetry SDK code shipped (agent/src/telemetry.ts), opt-in via overlays/tracing/. Caddy persona instrumentation pending follow-up (requires xcaddy rebuild with caddy-otel module).

Why distributed tracing here¶

The headline measurement of this test bed is the per-request latency experienced when an agent (browser engine or synthetic-load engine) talks to a persona webserver through the NGFW under test. A single request crosses four hops:

browser-engine/synthetic-load agent  →  NGFW (DUT, TLS leg 1)  →  NGFW (TLS leg 2)  →  Caddy persona
     hop 1                      hop 2                 hop 3                 hop 4

Without distributed tracing, the operator only sees the black-box end-to-end latency from the agent's perspective. A regression of 50 ms could be in the NGFW, in the Caddy persona, or in the Linux kernel between them — Prometheus metrics alone cannot tell you which.

Distributed tracing surfaces the per-hop breakdown, which is exactly what commercial appliances (Spirent CyberFlood, Ixia BreakingPoint) do not expose. This is a strategic differentiator for this project — and the reason we ship the SDK code at the agent level even though we keep it OPT-IN by default.

Opt-in by default — performance footprint considerations¶

Distributed tracing is disabled by default. The agents run with their normal memory + CPU footprint until the operator explicitly enables it.

Why opt-in: the agent fleet is designed for high agent counts (browser engine 1–300, synthetic-load engine 1–1000) on lab hardware that may have memory/CPU constraints. We don't want every operator to pay the SDK cost for every test run; we want them to choose to pay it during baseline measurement and deep-dive runs.

Footprint when ENABLED (measured on Ubuntu 22.04 + Node 20):

Resource	Per browser-engine agent	Per synthetic-load agent
RSS memory	+10–15 MB	+5–10 MB
CPU steady state	<1%	<1%
Per-request latency added	~50–100 µs	~50–100 µs
Per-request latency overhead vs 200 ms baseline	~0.05%	~0.05%

Mitigations baked into the overlay: - Probabilistic sampling at 10% (OTEL_TRACES_SAMPLER_ARG=0.1). At peak fleet (300 browser engine × 1 cycle/5 s) that is ~6 traces/s reaching Tempo - @opentelemetry/instrumentation-fs disabled — fs is noisy and irrelevant - HTTP exporter (not gRPC) — keep-alive, no extra binary deps - Bounded BatchSpanProcessor queue (OTEL_BSP_MAX_QUEUE_SIZE=2048) so memory is predictable - OTEL_SDK_DISABLED=true kill-switch always honoured

How to enable¶

After bringing up the cluster in any deployment mode:

sudo ./scripts/k8s-install.sh --mode=<single|dual|tri|multi> ...

# Toggle tracing on with the helper script. The script patches the
# running web-agent + k6-agent Deployments with OTEL_* env vars and
# rotates the pods automatically.
./scripts/tracing-toggle.sh on

Override defaults via environment:

TRACING_SAMPLE_RATE=1.0 ./scripts/tracing-toggle.sh on    # 100% sampling
TEMPO_HOST=tempo.observability.svc ./scripts/tracing-toggle.sh on  # custom Tempo

Verify the SDK booted (look for the [telemetry] OpenTelemetry SDK started line):

kubectl logs -n web-agents deploy/web-agent | grep telemetry
kubectl logs -n web-agents deploy/k6-agent  | grep telemetry

Open Tempo UI to see traces:

kubectl port-forward -n web-agents svc/tempo 3200:3200
# then visit http://localhost:3200

How to disable¶

./scripts/tracing-toggle.sh off

This sets OTEL_SDK_DISABLED=true on both Deployments and restarts the pods. The SDK init in agent/src/telemetry.ts returns early when it sees this flag, so no overhead is paid until you toggle back on.

To check current state:

./scripts/tracing-toggle.sh status

What the trace looks like¶

When enabled, each browser engine cycle becomes a parent span:

[span] web-agent: cycle (1.2s)
  ├─ [span] http: GET https://shop.persona.internal/  (340 ms)
  │   ├─ [span] dns: lookup shop.persona.internal  (2 ms)
  │   ├─ [span] tcp: connect 10.1.1.2:443  (8 ms)
  │   ├─ [span] tls: handshake  (45 ms)   ← agent → NGFW (TLS leg 1)
  │   └─ [span] http: response  (285 ms)  ← NGFW → persona (TLS leg 2 + content)
  ├─ [span] http: GET .../static/main.css  (120 ms)
  └─ ... per-resource child spans

Trace IDs propagate via standard W3C traceparent / tracestate headers. The NGFW must preserve these headers — most modern firewalls do; some QUIC inspection paths reset headers, so for HTTP/3 traces use HTTP/1.1 fallback or run with TLS decrypt ON so the NGFW reconstructs the request.

What is NOT traced (yet)¶

Caddy personas: Caddy 2.7+ has a built-in tracing directive but the caddy-otel module must be compiled into the image via xcaddy. Scheduled for a follow-up PR. Until that ships, the trace ends at the agent's HTTP response — you will see the request leave the agent and complete, but not the breakdown of what the persona did internally
NGFW spans: NGFW vendors don't typically emit OTLP. The NGFW shows up implicitly as time spent between the agent's HTTP send and the persona's HTTP response. To attribute that time precisely, either configure the NGFW's mirror port to a span-aware collector OR use the SNMP exporter's CPU + queue depth metrics as a coarse proxy

Tempo storage planning¶

With 10% sampling and a 1000-agent synthetic-load fleet running 100 RPS each: - Spans/s ingested: 100,000 × 0.1 ≈ 10,000 spans/s - Average span size: ~500 B - Bytes/s: ~5 MB/s - Storage for 24 h: ~430 GB

Defaults in observability/tempo/tempo.yml: - block_retention: 24h - max_block_bytes: 100 MiB - Storage backend: local (PVC)

Increase block_retention to 7d for longer retention; provision the PVC accordingly.

Roadmap¶

Phase	Status	What it covers
Tempo OTLP receiver enabled	✅ shipped (PR-4 #179)	gRPC :4317 + HTTP :4318
Agent OpenTelemetry SDK code	✅ shipped (this PR)	browser-engine + synthetic-load wrappers
Opt-in tracing overlay	✅ shipped (this PR)	`overlays/tracing/`
Caddy persona OTel module	⏳ follow-up	Image rebuild via xcaddy + Caddyfile tracing directive
NGFW span correlation	⏳ research	Vendor-specific; potentially via SNMP queue-depth as a proxy
Trace-driven dashboards	⏳ follow-up	Service map auto-built from spans, Tempo metrics_generator already wired