Performance & memory tuning¶

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

This document captures the knobs the agent exposes for operators who need to dial RSS, throughput, or fidelity to match their host / cluster envelope. All knobs are environment variables — there is no recompile required and no functional regression if you leave them at the defaults.

✍️ Author: André Luiz Gallon <agallon@cisco.com> Audience: SREs running the cluster locally (Docker Compose) or on Kubernetes / k3s.

TL;DR — recommended defaults¶

Profile	`CYCLE_CONCURRENCY`	`BROWSER_RECYCLE_AFTER`	`BROWSER_RECYCLE_AFTER_MS`	`MAX_RESOURCES_PER_CYCLE`	`MAX_RESOURCE_BYTES_INSPECTED`	`THROTTLE_RSS_SOFT_MB`	`THROTTLE_RSS_HARD_MB`
Default (laptop / 1 GB cgroup)	`3`	`50`	`900000` (15 min)	`300`	`2097152` (2 MiB)	`600`	`800`
Low-memory (≤ 512 MB cgroup)	`1`	`25`	`600000` (10 min)	`200`	`524288` (512 KiB)	`300`	`420`
High-fidelity (HAR-like)	`3`	`200`	`3600000` (60 min)	`1000`	`33554432` (32 MiB)	`1500`	`2000`

The default profile is what docker-compose.yml uses out of the box, and what the Helm chart applies for new clusters. The "low-memory" profile is appropriate for Raspberry Pi-class hosts. The "high-fidelity" profile reproduces the original (pre-tuning) behaviour and is meant for short capture sessions, not long-running fleets.

Environment variables — full reference¶

Cycle scheduling¶

Variable	Default	Effect
`CYCLE_CONCURRENCY`	`3`	Max parallel browser cycles per agent. Higher = more throughput, but each cycle owns a Chromium tab so RSS scales linearly.
`DEFAULT_CYCLE_INTERVAL_MS`	`30000`	Fallback cadence per target when no per-target value is set.
`NAVIGATION_TIMEOUT_MS`	`45000`	Hard timeout for `page.goto()` and friends.

Browser pool recycling¶

Variable	Default	Effect
`BROWSER_RECYCLE_AFTER`	`50`	Recycle (close + relaunch) the cached Chromium tree after this many cycles for the same key (e.g. `h2`, `h3:host`). Mitigates Chromium memory drift. `0` disables.
`BROWSER_RECYCLE_AFTER_MS`	`900000` (15 min)	Same as above but by wall-clock age. `0` disables.

Recycling is performed only when the pool is idle (inUse === 0), so a cycle never observes a half-killed browser.

Per-cycle byte / resource caps¶

Variable	Default	Effect
`MAX_RESOURCES_PER_CYCLE`	`300`	After this many resources have been recorded for one cycle, additional responses are observed for failure tracking but their metrics are dropped. The dashboard already truncates resource arrays at 500 / run for storage, so 300 sits well within the existing envelope.
`MAX_RESOURCE_BYTES_INSPECTED`	`2097152` (2 MiB)	Per-resource ceiling for the byte counter used in the metrics summary. The agent never pulls the response body into Node memory — it reads `content-length` (preferred) or `request.sizes().responseBodySize` and clamps to this cap.

RSS-aware auto-throttle¶

Variable	Default	Effect
`THROTTLE_RSS_SOFT_MB`	`600`	When the container's RSS crosses this value, the effective concurrency is halved.
`THROTTLE_RSS_HARD_MB`	`800`	When the container's RSS crosses this value, no new cycles are started. In-flight cycles continue and release will let the kernel reclaim Chromium pages. Set `0` to disable both bounds.

The agent reads container memory from the Linux cgroup filesystem — memory.current (cgroup v2) or memory.usage_in_bytes (cgroup v1) — which is the same number docker stats shows and the same number the OOM killer uses. On hosts where neither cgroup is detected (macOS native, BSDs) the agent falls back to Node's process.memoryUsage().rss (degraded but safe — it under-reports Chromium subprocess memory by ~10× so the throttle effectively never fires).

💡 Why cgroup and not process.memoryUsage().rss? Headless Chromium runs in a tree of subprocesses (zygote, GPU, network service, several renderers). Their memory does not appear in Node's own process.memoryUsage().rss. The cgroup is the only place where the SUM is accounted for. This was a real bug in the v1.1 agent — fixed in v1.2.

The throttle logs a cycle scheduler throttled by memory pressure warning at most once every 30 s, so you can correlate dashboard memory dips with this event. The log includes both containerRssMb (what's compared against the threshold) and nodeRssMb (Node-only) plus a source field saying whether the value came from cgroup or the Node fallback.

Logging¶

Variable	Default	Effect
`LOG_LEVEL`	`info`	One of `debug`, `info`, `warn`, `error`. `debug` is verbose enough that we don't recommend it in production for >5 agents.

Heartbeat metrics — what the dashboard sees¶

Each heartbeat now carries the agent's own resource posture, which the dashboard exposes via /api/metrics (Prometheus) and the agents page:

{
  "status": "idle",
  "browserKey": "h2",
  "browserPid": 1042,
  "browserAcquisitions": 17,
  "browserAgeMs": 412310,
  "browserInUse": 0,
  "rssBytes": 88210000,
  "heapUsedBytes": 41200000,
  "heapTotalBytes": 64200000,
  "externalBytes": 4200000,
  "containerRssBytes": 287612928,
  "cgroupAvailable": true,
  "cgroupLimitBytes": 1073741824
}

containerRssBytes is the best signal for "is this agent close to its cgroup limit?". It matches docker stats byte-for-byte and the OOM killer's accounting. This is the value the auto-throttle uses.
rssBytes is the Node process alone — useful for spotting JS heap leaks, but not for capacity planning (it ignores Chromium).
cgroupAvailable=false means we're falling back to Node RSS for the throttle (e.g. macOS host without Docker Desktop's Linux VM, or a non-cgroup namespace). Most production hosts will report true.
cgroupLimitBytes is your effective container memory ceiling (docker run --memory=… / Kubernetes limits.memory). null when no limit is set.
browserInUse > 0 while browserAgeMs is large is a healthy state (the pool is still serving cycles); recycling will happen when inUse next reaches 0.

How to measure¶

Inside Docker Compose¶

docker stats --no-stream --format \
  'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.PIDs}}' \
  | sort -k4 -h

A healthy agent at the default profile sits around 220–340 MB RSS and 40–80 PIDs under sustained load. The hard cgroup ceiling in docker-compose.yml is 1g / 500 PIDs.

Inside Kubernetes / k3s¶

kubectl top pods -n web-agent-cluster --sort-by=memory

The agent Deployment ships with resources.limits.memory=1Gi and resources.requests.memory=384Mi by default.

Long-running soak (one-liner)¶

docker compose up -d agent --scale agent=10
watch -n 5 'docker stats --no-stream --format "{{.Name}}\t{{.MemUsage}}"
            | grep agent | head -n 10'

Expected behaviour over 30 minutes:

RSS bumps up after each browser launch (≈ +60 MB), then drops back to baseline after the recycle window.
PID count drifts to ≈ 80 then collapses to ≈ 12 when the recycle fires.

If you observe monotonically growing RSS for >2 hours without a plateau, please open an issue with docker compose logs agent | tail -n 200 attached and your env block.

Trade-off notes¶

Streaming bytes vs. exact bytes. When content-length is missing AND request.sizes() throws, the byte counter degrades to 0. Metrics summaries (transferredBytes, mbpsAvg) under-count in this case, but the resource is still recorded with its other metadata (status, type, TLS version, …). This is acceptable because:
Most modern HTTP/1.1 and all HTTP/2/3 servers populate content-length or send a transfer-encoding: chunked whose transfer size browser engine does report.
The dashboard's mbpsAvg is averaged across many cycles; an occasional 0 has no visible impact on the chart.
Resource cap at 300. Pages with > 300 sub-resources (typical of ad-heavy news sites) will report resourceCount=300. The failedResourceCount is unaffected (we count failures from the requestfailed event, not from response handling).
Recycle window of 15 min. Aggressive recycling can briefly double Chromium memory at the moment a new tree is launched while the old one is being torn down. The mutex in BrowserPool serialises this so the spike is bounded to a single launch worth of memory (≈ 80 MB).

Changelog¶

2026-04-26 — initial version. Default profile reduces sustained RSS by ~45 % vs. the v1.0 release that buffered every response body.