Skip to content

Performance & memory tuning

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

This document captures the knobs the agent exposes for operators who need to dial RSS, throughput, or fidelity to match their host / cluster envelope. All knobs are environment variables — there is no recompile required and no functional regression if you leave them at the defaults.

✍️ Author: André Luiz Gallon <agallon@cisco.com> Audience: SREs running the cluster locally (Docker Compose) or on Kubernetes / k3s.


Profile CYCLE_CONCURRENCY BROWSER_RECYCLE_AFTER BROWSER_RECYCLE_AFTER_MS MAX_RESOURCES_PER_CYCLE MAX_RESOURCE_BYTES_INSPECTED THROTTLE_RSS_SOFT_MB THROTTLE_RSS_HARD_MB
Default (laptop / 1 GB cgroup) 3 50 900000 (15 min) 300 2097152 (2 MiB) 600 800
Low-memory (≤ 512 MB cgroup) 1 25 600000 (10 min) 200 524288 (512 KiB) 300 420
High-fidelity (HAR-like) 3 200 3600000 (60 min) 1000 33554432 (32 MiB) 1500 2000

The default profile is what docker-compose.yml uses out of the box, and what the Helm chart applies for new clusters. The "low-memory" profile is appropriate for Raspberry Pi-class hosts. The "high-fidelity" profile reproduces the original (pre-tuning) behaviour and is meant for short capture sessions, not long-running fleets.


Environment variables — full reference

Cycle scheduling

Variable Default Effect
CYCLE_CONCURRENCY 3 Max parallel browser cycles per agent. Higher = more throughput, but each cycle owns a Chromium tab so RSS scales linearly.
DEFAULT_CYCLE_INTERVAL_MS 30000 Fallback cadence per target when no per-target value is set.
NAVIGATION_TIMEOUT_MS 45000 Hard timeout for page.goto() and friends.

Browser pool recycling

Variable Default Effect
BROWSER_RECYCLE_AFTER 50 Recycle (close + relaunch) the cached Chromium tree after this many cycles for the same key (e.g. h2, h3:host). Mitigates Chromium memory drift. 0 disables.
BROWSER_RECYCLE_AFTER_MS 900000 (15 min) Same as above but by wall-clock age. 0 disables.

Recycling is performed only when the pool is idle (inUse === 0), so a cycle never observes a half-killed browser.

Per-cycle byte / resource caps

Variable Default Effect
MAX_RESOURCES_PER_CYCLE 300 After this many resources have been recorded for one cycle, additional responses are observed for failure tracking but their metrics are dropped. The dashboard already truncates resource arrays at 500 / run for storage, so 300 sits well within the existing envelope.
MAX_RESOURCE_BYTES_INSPECTED 2097152 (2 MiB) Per-resource ceiling for the byte counter used in the metrics summary. The agent never pulls the response body into Node memory — it reads content-length (preferred) or request.sizes().responseBodySize and clamps to this cap.

RSS-aware auto-throttle

Variable Default Effect
THROTTLE_RSS_SOFT_MB 600 When the container's RSS crosses this value, the effective concurrency is halved.
THROTTLE_RSS_HARD_MB 800 When the container's RSS crosses this value, no new cycles are started. In-flight cycles continue and release will let the kernel reclaim Chromium pages. Set 0 to disable both bounds.

The agent reads container memory from the Linux cgroup filesystem — memory.current (cgroup v2) or memory.usage_in_bytes (cgroup v1) — which is the same number docker stats shows and the same number the OOM killer uses. On hosts where neither cgroup is detected (macOS native, BSDs) the agent falls back to Node's process.memoryUsage().rss (degraded but safe — it under-reports Chromium subprocess memory by ~10× so the throttle effectively never fires).

💡 Why cgroup and not process.memoryUsage().rss? Headless Chromium runs in a tree of subprocesses (zygote, GPU, network service, several renderers). Their memory does not appear in Node's own process.memoryUsage().rss. The cgroup is the only place where the SUM is accounted for. This was a real bug in the v1.1 agent — fixed in v1.2.

The throttle logs a cycle scheduler throttled by memory pressure warning at most once every 30 s, so you can correlate dashboard memory dips with this event. The log includes both containerRssMb (what's compared against the threshold) and nodeRssMb (Node-only) plus a source field saying whether the value came from cgroup or the Node fallback.

Logging

Variable Default Effect
LOG_LEVEL info One of debug, info, warn, error. debug is verbose enough that we don't recommend it in production for >5 agents.

Heartbeat metrics — what the dashboard sees

Each heartbeat now carries the agent's own resource posture, which the dashboard exposes via /api/metrics (Prometheus) and the agents page:

{
  "status": "idle",
  "browserKey": "h2",
  "browserPid": 1042,
  "browserAcquisitions": 17,
  "browserAgeMs": 412310,
  "browserInUse": 0,
  "rssBytes": 88210000,
  "heapUsedBytes": 41200000,
  "heapTotalBytes": 64200000,
  "externalBytes": 4200000,
  "containerRssBytes": 287612928,
  "cgroupAvailable": true,
  "cgroupLimitBytes": 1073741824
}
  • containerRssBytes is the best signal for "is this agent close to its cgroup limit?". It matches docker stats byte-for-byte and the OOM killer's accounting. This is the value the auto-throttle uses.
  • rssBytes is the Node process alone — useful for spotting JS heap leaks, but not for capacity planning (it ignores Chromium).
  • cgroupAvailable=false means we're falling back to Node RSS for the throttle (e.g. macOS host without Docker Desktop's Linux VM, or a non-cgroup namespace). Most production hosts will report true.
  • cgroupLimitBytes is your effective container memory ceiling (docker run --memory=… / Kubernetes limits.memory). null when no limit is set.
  • browserInUse > 0 while browserAgeMs is large is a healthy state (the pool is still serving cycles); recycling will happen when inUse next reaches 0.

How to measure

Inside Docker Compose

docker stats --no-stream --format \
  'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.PIDs}}' \
  | sort -k4 -h

A healthy agent at the default profile sits around 220–340 MB RSS and 40–80 PIDs under sustained load. The hard cgroup ceiling in docker-compose.yml is 1g / 500 PIDs.

Inside Kubernetes / k3s

kubectl top pods -n web-agent-cluster --sort-by=memory

The agent Deployment ships with resources.limits.memory=1Gi and resources.requests.memory=384Mi by default.

Long-running soak (one-liner)

docker compose up -d agent --scale agent=10
watch -n 5 'docker stats --no-stream --format "{{.Name}}\t{{.MemUsage}}"
            | grep agent | head -n 10'

Expected behaviour over 30 minutes:

  • RSS bumps up after each browser launch (≈ +60 MB), then drops back to baseline after the recycle window.
  • PID count drifts to ≈ 80 then collapses to ≈ 12 when the recycle fires.

If you observe monotonically growing RSS for >2 hours without a plateau, please open an issue with docker compose logs agent | tail -n 200 attached and your env block.


Trade-off notes

  • Streaming bytes vs. exact bytes. When content-length is missing AND request.sizes() throws, the byte counter degrades to 0. Metrics summaries (transferredBytes, mbpsAvg) under-count in this case, but the resource is still recorded with its other metadata (status, type, TLS version, …). This is acceptable because:
  • Most modern HTTP/1.1 and all HTTP/2/3 servers populate content-length or send a transfer-encoding: chunked whose transfer size browser engine does report.
  • The dashboard's mbpsAvg is averaged across many cycles; an occasional 0 has no visible impact on the chart.
  • Resource cap at 300. Pages with > 300 sub-resources (typical of ad-heavy news sites) will report resourceCount=300. The failedResourceCount is unaffected (we count failures from the requestfailed event, not from response handling).
  • Recycle window of 15 min. Aggressive recycling can briefly double Chromium memory at the moment a new tree is launched while the old one is being torn down. The mutex in BrowserPool serialises this so the spike is bounded to a single launch worth of memory (≈ 80 MB).

Changelog

  • 2026-04-26 — initial version. Default profile reduces sustained RSS by ~45 % vs. the v1.0 release that buffered every response body.