Performance & memory tuning¶
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.
This document captures the knobs the agent exposes for operators who need to dial RSS, throughput, or fidelity to match their host / cluster envelope. All knobs are environment variables — there is no recompile required and no functional regression if you leave them at the defaults.
✍️ Author: André Luiz Gallon
<agallon@cisco.com>Audience: SREs running the cluster locally (Docker Compose) or on Kubernetes / k3s.
TL;DR — recommended defaults¶
| Profile | CYCLE_CONCURRENCY |
BROWSER_RECYCLE_AFTER |
BROWSER_RECYCLE_AFTER_MS |
MAX_RESOURCES_PER_CYCLE |
MAX_RESOURCE_BYTES_INSPECTED |
THROTTLE_RSS_SOFT_MB |
THROTTLE_RSS_HARD_MB |
|---|---|---|---|---|---|---|---|
| Default (laptop / 1 GB cgroup) | 3 |
50 |
900000 (15 min) |
300 |
2097152 (2 MiB) |
600 |
800 |
| Low-memory (≤ 512 MB cgroup) | 1 |
25 |
600000 (10 min) |
200 |
524288 (512 KiB) |
300 |
420 |
| High-fidelity (HAR-like) | 3 |
200 |
3600000 (60 min) |
1000 |
33554432 (32 MiB) |
1500 |
2000 |
The default profile is what docker-compose.yml uses out of the box,
and what the Helm chart applies for new clusters. The "low-memory"
profile is appropriate for Raspberry Pi-class hosts. The "high-fidelity"
profile reproduces the original (pre-tuning) behaviour and is meant for
short capture sessions, not long-running fleets.
Environment variables — full reference¶
Cycle scheduling¶
| Variable | Default | Effect |
|---|---|---|
CYCLE_CONCURRENCY |
3 |
Max parallel browser cycles per agent. Higher = more throughput, but each cycle owns a Chromium tab so RSS scales linearly. |
DEFAULT_CYCLE_INTERVAL_MS |
30000 |
Fallback cadence per target when no per-target value is set. |
NAVIGATION_TIMEOUT_MS |
45000 |
Hard timeout for page.goto() and friends. |
Browser pool recycling¶
| Variable | Default | Effect |
|---|---|---|
BROWSER_RECYCLE_AFTER |
50 |
Recycle (close + relaunch) the cached Chromium tree after this many cycles for the same key (e.g. h2, h3:host). Mitigates Chromium memory drift. 0 disables. |
BROWSER_RECYCLE_AFTER_MS |
900000 (15 min) |
Same as above but by wall-clock age. 0 disables. |
Recycling is performed only when the pool is idle (inUse === 0),
so a cycle never observes a half-killed browser.
Per-cycle byte / resource caps¶
| Variable | Default | Effect |
|---|---|---|
MAX_RESOURCES_PER_CYCLE |
300 |
After this many resources have been recorded for one cycle, additional responses are observed for failure tracking but their metrics are dropped. The dashboard already truncates resource arrays at 500 / run for storage, so 300 sits well within the existing envelope. |
MAX_RESOURCE_BYTES_INSPECTED |
2097152 (2 MiB) |
Per-resource ceiling for the byte counter used in the metrics summary. The agent never pulls the response body into Node memory — it reads content-length (preferred) or request.sizes().responseBodySize and clamps to this cap. |
RSS-aware auto-throttle¶
| Variable | Default | Effect |
|---|---|---|
THROTTLE_RSS_SOFT_MB |
600 |
When the container's RSS crosses this value, the effective concurrency is halved. |
THROTTLE_RSS_HARD_MB |
800 |
When the container's RSS crosses this value, no new cycles are started. In-flight cycles continue and release will let the kernel reclaim Chromium pages. Set 0 to disable both bounds. |
The agent reads container memory from the Linux cgroup filesystem
— memory.current (cgroup v2) or memory.usage_in_bytes (cgroup v1)
— which is the same number docker stats shows and the same number
the OOM killer uses. On hosts where neither cgroup is detected
(macOS native, BSDs) the agent falls back to Node's
process.memoryUsage().rss (degraded but safe — it under-reports
Chromium subprocess memory by ~10× so the throttle effectively never
fires).
💡 Why cgroup and not
process.memoryUsage().rss? Headless Chromium runs in a tree of subprocesses (zygote, GPU, network service, several renderers). Their memory does not appear in Node's ownprocess.memoryUsage().rss. The cgroup is the only place where the SUM is accounted for. This was a real bug in the v1.1 agent — fixed in v1.2.
The throttle logs a cycle scheduler throttled by memory pressure
warning at most once every 30 s, so you can correlate dashboard
memory dips with this event. The log includes both containerRssMb
(what's compared against the threshold) and nodeRssMb (Node-only)
plus a source field saying whether the value came from cgroup or
the Node fallback.
Logging¶
| Variable | Default | Effect |
|---|---|---|
LOG_LEVEL |
info |
One of debug, info, warn, error. debug is verbose enough that we don't recommend it in production for >5 agents. |
Heartbeat metrics — what the dashboard sees¶
Each heartbeat now carries the agent's own resource posture, which the
dashboard exposes via /api/metrics (Prometheus) and the agents page:
{
"status": "idle",
"browserKey": "h2",
"browserPid": 1042,
"browserAcquisitions": 17,
"browserAgeMs": 412310,
"browserInUse": 0,
"rssBytes": 88210000,
"heapUsedBytes": 41200000,
"heapTotalBytes": 64200000,
"externalBytes": 4200000,
"containerRssBytes": 287612928,
"cgroupAvailable": true,
"cgroupLimitBytes": 1073741824
}
containerRssBytesis the best signal for "is this agent close to its cgroup limit?". It matchesdocker statsbyte-for-byte and the OOM killer's accounting. This is the value the auto-throttle uses.rssBytesis the Node process alone — useful for spotting JS heap leaks, but not for capacity planning (it ignores Chromium).cgroupAvailable=falsemeans we're falling back to Node RSS for the throttle (e.g. macOS host without Docker Desktop's Linux VM, or a non-cgroup namespace). Most production hosts will reporttrue.cgroupLimitBytesis your effective container memory ceiling (docker run --memory=…/ Kuberneteslimits.memory).nullwhen no limit is set.browserInUse > 0whilebrowserAgeMsis large is a healthy state (the pool is still serving cycles); recycling will happen wheninUsenext reaches0.
How to measure¶
Inside Docker Compose¶
docker stats --no-stream --format \
'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.PIDs}}' \
| sort -k4 -h
A healthy agent at the default profile sits around 220–340 MB RSS
and 40–80 PIDs under sustained load. The hard cgroup ceiling in
docker-compose.yml is 1g / 500 PIDs.
Inside Kubernetes / k3s¶
kubectl top pods -n web-agent-cluster --sort-by=memory
The agent Deployment ships with resources.limits.memory=1Gi and
resources.requests.memory=384Mi by default.
Long-running soak (one-liner)¶
docker compose up -d agent --scale agent=10
watch -n 5 'docker stats --no-stream --format "{{.Name}}\t{{.MemUsage}}"
| grep agent | head -n 10'
Expected behaviour over 30 minutes:
- RSS bumps up after each browser launch (≈ +60 MB), then drops back to baseline after the recycle window.
- PID count drifts to ≈ 80 then collapses to ≈ 12 when the recycle fires.
If you observe monotonically growing RSS for >2 hours without a
plateau, please open an issue with
docker compose logs agent | tail -n 200 attached and your env block.
Trade-off notes¶
- Streaming bytes vs. exact bytes. When
content-lengthis missing ANDrequest.sizes()throws, the byte counter degrades to0. Metrics summaries (transferredBytes,mbpsAvg) under-count in this case, but the resource is still recorded with its other metadata (status, type, TLS version, …). This is acceptable because: - Most modern HTTP/1.1 and all HTTP/2/3 servers populate
content-lengthor send atransfer-encoding: chunkedwhose transfer size browser engine does report. - The dashboard's
mbpsAvgis averaged across many cycles; an occasional 0 has no visible impact on the chart. - Resource cap at 300. Pages with > 300 sub-resources (typical of
ad-heavy news sites) will report
resourceCount=300. ThefailedResourceCountis unaffected (we count failures from therequestfailedevent, not from response handling). - Recycle window of 15 min. Aggressive recycling can briefly double
Chromium memory at the moment a new tree is launched while the old
one is being torn down. The mutex in
BrowserPoolserialises this so the spike is bounded to a single launch worth of memory (≈ 80 MB).
Changelog¶
- 2026-04-26 — initial version. Default profile reduces sustained RSS by ~45 % vs. the v1.0 release that buffered every response body.