Public Website Cloner¶
Component: cloner · Version: v3.6.0
Image: ghcr.io/nollagluiz/web-agent-cloner:v3.6.0
Namespace: web-agents · Replicas: 1
Scope status (post-Scope-Freeze 2026-05-10) — CLONER is now formally MÓDULO CLONER.Art with 9 functions (not just web cloning). See
modules/cloner-art.mdfor the canonical 9-function breakdown andhelp-center/primers/obp-operator-bridge-proxy.mdfor the air-gap egress integration. This document describes the v3.6.0 web-cloning function (fn #1) in detail; ops detail for fns2-#9 lives in
CLONER_OPERATIONS.md.¶
Table of Contents¶
- Why It Exists
- What It Delivers
- Architecture Overview
- Network Design
- Technology Stack
- The Cloning Engine
- Dual-Network Routing
- Automation Pipeline
- Health Monitoring
- Database Schema
- Content Delivery to Cloned Personas
- Observability
- Configuration Reference
- Key Files
1. Why It Exists¶
The test-bed's primary goal is to measure NGFW TLS decryption performance under realistic traffic conditions. Synthetic Personas (the 20 simulated webservers) are carefully designed to stress-test specific aspects of TLS inspection — API patterns, streaming, large file downloads, e-commerce flows. However, they are purpose-built lab artifacts.
The Cloner exists to answer a different question:
How does the NGFW behave when inspecting real-world website content — with its actual HTML structure, third-party scripts, ad networks, tracking pixels, CDN references, obfuscated JavaScript, and fingerprinting techniques?
Real public websites exercise NGFW inspection engines differently than synthetic workloads. They include:
- Deep JavaScript obfuscation — challenges application identification
- Mixed content patterns — inline CSS, external fonts, cross-origin images
- Dynamic payloads — pages that differ per visit or user agent
- Anti-bot techniques — which stress-test TLS session tracking in the NGFW
- Modern web frameworks — React, Next.js, Vue single-page applications with complex chunked loading
The Cloner captures all of this from the live internet and makes it available as a static, reproducible mirror that Cloned Persona webservers serve to the test agents — through the NGFW.
2. What It Delivers¶
| Capability | Detail |
|---|---|
| Real-world site capture | Downloads full HTML, CSS, JS, images, fonts from any public URL |
| Reproducibility | Frozen static mirror — same content on every test run |
| NGFW realism | Agents access real site content through TLS inspection (not simulated payloads) |
| Zero NGFW bypass | Cloner downloads via its own ISP interface (VLAN 40); agents access cloned content via NGFW (VLANs 200–209) |
| Automated pipeline | Job queue, atomic claiming, status tracking, heartbeat — no manual steps after job creation |
| Anti-detection | Stealth browser mode bypasses bot protection on target sites |
| Live health telemetry | Internet connectivity metrics exposed to Prometheus in real time |
| Slot orchestration | Dashboard assigns cloned sites to up to 10 Cloned Persona slots with a single API call |
3. Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────┐
│ Cloner Pod (namespace: web-agents) │
│ │
│ eth0 ─── K8s OOBI ────────────────────────────────────────────── │
│ │ │
│ ├── Dashboard (http://dashboard:3000) │
│ │ ├── POST /api/clone/agents/register │
│ │ ├── GET /api/clone/agents/{id}/job │
│ │ ├── POST /api/clone/agents/{id}/heartbeat │
│ │ └── PATCH /api/clone/jobs/{id} │
│ └── CoreDNS (10.96.0.10) │
│ │
│ net1 ─── VLAN 40 macvlan ──────────────────────────────────────── │
│ │ DHCP IP from upstream router │
│ └── Internet (TCP 80/443, UDP 443/QUIC, DNS, ICMP) │
│ │
│ /mnt/cloned/{site}/ ─── NFS PVC: cloned-sites (RWX) ─────────── │
└─────────────────────────────────────────────────────────────────────┘
▼ writes over OOBI (eth0)
NFS Server (web-agents/nfs-server) — backed by hostPath
on the node hosting the cloner (role=infra in multi-node,
the only node in single-node).
▼ exported via NFSv4 / OOBI ClusterIP
NFS PVCs (per-namespace, ROX) — 10 slots mount read-only
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Clone Persona Pods (clone-persona-1 … clone-persona-10) │
│ VLANs 200–209 / 10.2.{1..10}.0/27 │
│ Caddy: file_server /mnt/cloned/{SITE_NAME} │
└─────────────────────────────────────────────────────────────────────┘
▲ NGFW inspects TLS
┌─────────────────────────────────────────────────────────────────────┐
│ Test Agents (browser-engine VLAN 20, synthetic-load VLAN 30) │
│ https://10.2.{n}.{x}/ → routed via NGFW │
└─────────────────────────────────────────────────────────────────────┘
The Cloner and the test agents operate on completely separate network paths:
- Cloner downloads:
net1(VLAN 40) → ISP → public internet. The NGFW is not in this path. - Agents test:
net1(VLAN 20/30) → NGFW → Cloned Persona Caddy. The NGFW is in this path.
This separation is fundamental. The NGFW inspects the content that was cloned, but plays no role in the cloning itself — avoiding circular TLS dependencies and contaminated latency measurements.
4. Network Design¶
Two Network Interfaces¶
| Interface | Network | Purpose |
|---|---|---|
eth0 |
K8s OOBI (cluster default) | Control plane: Dashboard, CoreDNS, Prometheus scraping |
net1 |
VLAN 40 macvlan, DHCP | Data plane: internet downloads |
The macvlan interface on VLAN 40 is provisioned by Multus CNI using the cloner-isp NetworkAttachmentDefinition (k8s/80-cloner-nad.yaml). It attaches to eth1.40 on the physical trunk, connecting the pod directly to the upstream ISP router segment via the Nexus 9000.
Forced DNS (dnsPolicy: None)¶
The pod uses a custom resolver list that overrides DHCP-provided DNS entirely:
8.8.8.8 Google Public DNS — public name resolution
208.67.222.222 OpenDNS — redundant public resolver
10.96.0.10 k3s CoreDNS — internal K8s service names
DNS queries to the two public resolvers are non-RFC1918 addresses, so they automatically receive the iptables fwmark and exit via net1 (ISP). CoreDNS queries remain on eth0 (OOBI). The result is clean DNS segregation with no split-horizon configuration.
5. Technology Stack¶
| Layer | Technology | Role |
|---|---|---|
| Runtime | Node.js (ESM) | Main process, async event loop |
| Browser automation | browser engine | Headless Chromium for site navigation and resource capture |
| Stealth layer | playwright-extra + puppeteer-extra-plugin-stealth | Bypasses anti-bot detection on target sites |
| Network routing | Linux policy routing (ip rule, ip route) + iptables mangle |
Dual-NIC traffic segregation |
| Multus CNI | k8s.v1.cni.cncf.io/networks | Secondary network interface (VLAN 40) attachment |
| Storage | Kubernetes PersistentVolumeClaim (local-path, 20 Gi) |
Persistent mirror storage at /mnt/cloned/ |
| Orchestration | Dashboard REST API (PostgreSQL-backed) | Job queue, agent registry, slot assignment |
| Metrics | Prometheus text format, exposed on :8081/metrics |
Internet health telemetry |
| Container | Alpine 3.21 (initContainer), custom Node.js image (main) | initContainer for routing setup, main image for cloning |
6. The Cloning Engine¶
Source: cloner/src/cloner.ts
The cloning engine uses a full Chromium browser (not a simple HTTP crawler) because modern websites require JavaScript execution to render their full asset tree. A curl-based approach would miss dynamically injected scripts, lazy-loaded images, and SPAgenerated routes.
Browser Configuration¶
const browser = await chromium.launch({
headless: false, // Xvfb-based display — not truly headless, harder to detect
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled',
'--window-size=1920,1080',
],
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...',
viewport: { width: 1920, height: 1080 },
locale: 'en-US',
timezoneId: 'America/New_York',
});
The --disable-blink-features=AutomationControlled flag removes the Blink automation hint that many anti-bot systems check. Combined with puppeteer-extra-plugin-stealth, the browser presents as a normal user session.
navigator.webdriver Bypass¶
await page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});
This script runs before any page JavaScript, ensuring navigator.webdriver is undefined rather than true — the standard fingerprint check used by Cloudflare, Akamai, and similar bot protection systems.
Full Resource Interception¶
Every network response is intercepted and captured in memory before fulfillment:
await context.route('**/*', async (route) => {
const response = await route.fetch();
const body = await response.body();
resources.set(route.request().url(), {
data: body,
contentType: response.headers()['content-type'],
});
await route.fulfill({ response });
});
This captures all resources — HTML, CSS, JavaScript, images, fonts, JSON API responses, SVGs — regardless of whether they are loaded synchronously or asynchronously by JavaScript.
Lazy-Load Trigger¶
After the initial networkidle wait, the engine scrolls to the bottom of the page to trigger viewport-based lazy loading:
await page.goto(url, { waitUntil: 'networkidle', timeout: 60_000 });
await page.evaluate(() =>
window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' }),
);
await page.waitForTimeout(2_000);
Asset Writing and URL Rewriting¶
Captured resources are written to disk under CLONE_STORAGE_PATH/{personaName}/ following a path structure derived from the resource URLs:
- Same-origin resources: path mirrors the URL path (
/about/→about/index.html) - Cross-origin resources: written to
_ext/{hostname}/{sanitized_path} - Extensionless paths:
.htmlappended automatically
HTML files are post-processed to rewrite absolute same-origin URLs to relative paths, making the mirror self-contained:
function rewriteHtml(html: string, baseUrl: string): string {
const origin = new URL(baseUrl).origin;
return html
.replace(/(href|src|action)="${origin}/g, '$1="')
.replace(/(href|src|action)='${origin}/g, "$1='");
}
7. Dual-Network Routing¶
Source: routing-init initContainer, k8s/81-cloner-deployment.yaml
The routing setup is performed by an Alpine Linux initContainer with NET_ADMIN and NET_RAW capabilities. It runs before the main cloner container starts and configures the kernel so that:
- All non-RFC1918 traffic exits via
net1(ISP) - All RFC1918 traffic stays on
eth0(K8s OOBI)
Implementation¶
# 1. Wait for DHCP on net1 (up to 60 s)
ISP_ADDR=$(ip -4 addr show net1 | awk '/inet / {print $2; exit}')
# 2. Routing table 100: default exit via ISP gateway
ip route add table 100 default via "$ISP_GW" dev net1
# 3. Policy rule: fwmark 100 → table 100
ip rule add fwmark 100 table 100 priority 10
# 4. iptables mangle: mark all non-RFC1918 OUTPUT packets
for RFC1918 in 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 127.0.0.0/8; do
iptables -t mangle -A OUTPUT -d "$RFC1918" -j RETURN
done
iptables -t mangle -A OUTPUT -j MARK --set-mark 100
Traffic covered by the ISP routing mark¶
| Protocol | Port | Purpose |
|---|---|---|
| TCP | 80 | HTTP downloads |
| TCP | 443 | HTTPS/HTTP2 downloads |
| UDP | 443 | HTTP/3 QUIC downloads |
| UDP | 53 | DNS queries to 8.8.8.8 / 208.67.222.222 |
| ICMP | — | Health monitor pings |
Gateway Discovery for Health Monitor¶
The initContainer writes the discovered ISP gateway to a shared emptyDir volume:
echo "$ISP_GW" > /var/isp-config/gateway.txt
echo "$ISP_IP" > /var/isp-config/isp-ip.txt
The main container reads gateway.txt at startup to begin pinging the ISP gateway.
8. Automation Pipeline¶
Source: cloner/src/index.ts, cloner/src/controller-client.ts
The cloner operates as a worker agent in a producer-consumer queue backed by PostgreSQL.
Startup Sequence¶
1. Start HTTP server on :8081 (healthz + metrics + file serving)
2. Start internet health monitor (ping loop every 10 s)
3. Register with Dashboard (POST /api/clone/agents/register, with retries)
4. Start heartbeat loop (POST /api/clone/agents/{id}/heartbeat every 15 s)
5. Start poll loop (GET /api/clone/agents/{id}/job every 10 s)
Registration uses exponential backoff (5 s × attempt, capped at 60 s), handling the case where the Dashboard is not yet ready when the cloner pod starts.
Job Claiming (SKIP LOCKED)¶
Jobs are claimed atomically via a PostgreSQL SELECT ... FOR UPDATE SKIP LOCKED pattern. This ensures that if multiple Cloner replicas are running simultaneously, each job is processed by exactly one agent:
UPDATE clone_jobs
SET status = 'running', agent_id = $1, started_at = now()
WHERE id = (
SELECT id FROM clone_jobs
WHERE status = 'pending'
ORDER BY created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
Job Lifecycle¶
Dashboard operator creates job
│
▼
clone_jobs.status = 'pending'
│
▼ (cloner polls, claims with SKIP LOCKED)
clone_jobs.status = 'running'
clone_agents.status = 'running'
│
├── Success ──▶ status = 'completed', asset_count, total_bytes recorded
└── Failure ──▶ status = 'failed', error_message recorded
clone_agents.status = 'idle'
Sequential Execution¶
The poll loop is guarded by a busy flag — only one job runs at a time per cloner instance. This prevents concurrent Chromium sessions from competing for the same ISP bandwidth and ensures deterministic behavior:
let busy = false;
const pollTimer = setInterval(async () => {
if (busy) return;
busy = true;
// ... clone the site ...
busy = false;
}, config.pollIntervalMs);
9. Health Monitoring¶
Source: cloner/src/health-monitor.ts
The health monitor runs a parallel ping loop every 10 seconds, tracking connectivity on three targets:
| Target | Discovery | Purpose |
|---|---|---|
8.8.8.8 |
Fixed | Google Public DNS — confirms ISP internet access |
1.1.1.1 |
Fixed | Cloudflare Public DNS — redundant internet check |
| ISP gateway | From /var/isp-config/gateway.txt |
Confirms L3 reachability to upstream router |
Gateway Discovery Fallback Chain¶
If the /var/isp-config/gateway.txt file is not present (e.g., in Docker Compose mode without the initContainer), the monitor falls back through:
ip -4 route show dev net1 default— ISP-specific default routeip -4 route show dev eth1 default— alternate ISP interface nameip -4 route show default— container's general default gateway
Ping Implementation¶
Uses ICMP via the system ping command (ping -c 1 -W 3 <target>) and parses the RTT from the output. The check is non-blocking and runs all three targets concurrently.
10. Database Schema¶
Migrations: dashboard/src/db/migrations/0017_clone_stack.sql, 0018_clone_persona_slots.sql
clone_agents¶
Tracks registered cloner pods. Each pod registers on startup and updates last_seen_at via heartbeat.
| Column | Type | Description |
|---|---|---|
id |
text PK | Unique agent identifier (cloner-{hostname}-{random}) |
hostname |
text | Pod hostname |
pod_ip |
text | eth0 IP of the pod |
version |
text | Cloner image version |
status |
enum | unknown / idle / running / error / offline |
last_seen_at |
timestamptz | Last heartbeat timestamp |
clone_jobs¶
The job queue. Each row is one website clone request.
| Column | Type | Description |
|---|---|---|
id |
uuid PK | Job identifier |
url |
text | Target public URL to clone |
persona_name |
text FK → personas | Destination persona slot name |
status |
enum | pending → running → completed / failed / cancelled |
agent_id |
text FK → clone_agents | Which cloner is/was executing |
asset_count |
integer | Number of resources captured |
total_bytes |
bigint | Total bytes written to disk |
error_message |
text | Failure description if status=failed |
started_at |
timestamptz | When the cloner began executing |
completed_at |
timestamptz | When the job finished |
clone_persona_slots¶
Tracks the 10 Cloned Persona webserver slots. Pre-populated with 10 rows on migration.
| Column | Type | Description |
|---|---|---|
slot_id |
integer PK (1–10) | Slot number |
site_name |
text | Site currently assigned (NULL = inactive) |
replicas |
integer | Desired Caddy replicas (0 = stopped) |
status |
enum | inactive / starting / active / error |
vlan_id |
integer (generated) | Always 200 + slot_id - 1 |
subnet |
text (generated) | Always 10.2.{slot_id}.0/27 |
namespace |
text (generated) | Always clone-persona-{slot_id} |
11. Content Delivery to Cloned Personas¶
Once cloning completes, the content lives in the cloned-sites PVC under /mnt/cloned/{site_name}/. The Cloned Persona webservers (10 slots, VLANs 200–209) mount this PVC read-only and serve the content directly.
Slot Activation Flow¶
Operator: PATCH /api/clone/persona-slots/3
{ "siteName": "example-shop", "replicas": 1 }
│
▼
Dashboard: 1. Update DB: clone_persona_slots row 3
site_name = 'example-shop'
replicas = 1
status = 'starting'
2. K8s API: patch ConfigMap clone-persona-3-config
SITE_NAME = "example-shop"
3. K8s API: scale Deployment caddy (clone-persona-3) → 1
│
▼
Stakater Reloader: detects ConfigMap change → triggers rolling restart
│
▼
Caddy pod: SITE_NAME=example-shop
root * /mnt/cloned/example-shop
file_server { precompressed gzip }
│
▼
Test agents access: https://10.2.3.{x}/ → NGFW → Caddy → /mnt/cloned/example-shop/
No Reverse Proxy¶
Cloned Persona pods do not proxy to any external service. Content is served directly from the PVC via Caddy's file_server directive. This means:
- Zero OOBI traffic from Cloned Persona pods during test execution
- No intermediate hop that could mask or alter the content the NGFW inspects
- No dependency on the Cloner pod being healthy during test runs — the PVC is independent
12. Observability¶
Prometheus Metrics (:8081/metrics)¶
| Metric | Type | Description |
|---|---|---|
cloner_internet_up{target} |
gauge | 1 if last ICMP ping to 8.8.8.8 or 1.1.1.1 succeeded |
cloner_internet_any_up |
gauge | 1 if at least one internet target is reachable |
cloner_gateway_up{gateway} |
gauge | 1 if DHCP gateway responds to ICMP |
cloner_ping_rtt_ms{target} |
gauge | Round-trip time of last successful ping (ms) |
cloner_ping_checks_total{target,result} |
counter | Cumulative ping checks by result (ok / fail) |
ServiceMonitor¶
A Prometheus Operator ServiceMonitor (k8s/71-service-monitors.yaml) scrapes :8081/metrics every 15 seconds. No relabeling is needed — metrics are already cloner-specific.
Grafana Dashboard¶
The "Cloner — Internet Health & Jobs" row in observability/grafana/dashboards/web-agent-cluster.json contains:
| Panel | Query |
|---|---|
| Acesso Internet (ISP) | cloner_internet_any_up |
| Ping 8.8.8.8 | cloner_internet_up{target="8.8.8.8"} |
| Ping 1.1.1.1 | cloner_internet_up{target="1.1.1.1"} |
| Gateway ISP | cloner_gateway_up |
| Ping RTT (ms) | cloner_ping_rtt_ms time series |
| Ping RTT Gateway ISP | cloner_ping_rtt_ms{target=~"gateway.*"} time series |
13. Configuration Reference¶
ConfigMap cloner-config¶
| Variable | Default | Description |
|---|---|---|
CONTROLLER_URL |
http://dashboard:3000 |
Dashboard base URL |
CLONE_STORAGE_PATH |
/mnt/cloned |
Root directory for cloned site mirrors |
SERVE_PORT |
8081 |
Port for HTTP server (healthz, metrics) |
POLL_INTERVAL_MS |
10000 |
How often to poll Dashboard for new jobs (ms) |
HEARTBEAT_INTERVAL_MS |
15000 |
How often to send heartbeat to Dashboard (ms) |
Secret cloner-secrets¶
| Variable | Description |
|---|---|
CONTROLLER_TOKEN |
Bearer token matching AGENT_API_TOKEN in Dashboard secrets |
Resource Limits¶
| Resource | Request | Limit |
|---|---|---|
| CPU | 500m | 2000m |
| Memory | 1 Gi | 4 Gi |
| Ephemeral storage | 512 Mi | 2 Gi |
The 4 Gi memory limit accommodates Chromium's memory usage during page loading. Sites with heavy JavaScript frameworks can use 1–2 Gi during initial render.
Storage — in-cluster NFS (cloned-sites)¶
The cloned-sites volume is shared between the Cloner (writer) and the 10 Clone Persona slot pods (readers). Because PVCs are namespace-scoped and the slots live in 10 separate namespaces, the share is provisioned as 11 static PV/PVC pairs all pointing at the same NFS export served by an in-cluster NFS server.
| Component | Manifest | Purpose |
|---|---|---|
| NFS server Deployment + Service | k8s/dut/35-nfs-server.yaml |
Single-replica NFS server. nodeAffinity prefers role=infra; falls through on single-node. Backed by hostPath /var/lib/agent-cluster/cloned-sites. Recreate strategy keeps a single writer. A prepare-share initContainer chowns the export to UID 10004 once before the daemon boots. |
| Static PVs + PVCs | k8s/dut/36-cloned-sites-pvs.yaml |
1 RWX writer pair (web-agents/cloned-sites) + 10 ROX reader pairs (clone-persona-N/cloned-sites). All point to the same NFS export root. storageClassName: "" prevents CSI provisioners from racing to claim them. |
Topology (multi-node):
- Cloner pod runs on
role=infra(UCS-4) — same node as the NFS server. - Clone-persona slot pods run on
role=ngfw-dut(UCS-1). - All NFS traffic flows over the OOBI control plane (eth0 → ClusterIP
Service
nfs-server.web-agents.svc.cluster.local:2049). The data plane (net1macvlan, VLAN 40 ISP and VLANs 200–209 slot subnets) carries no storage I/O — it stays exclusive for agent ↔ NGFW ↔ persona traffic so timing measurements remain uncontaminated.
Topology (single-node):
- Everything lands on the only node. The NFS server's hostPath stays local; clients still mount via NFS, but the round-trip is loopback — overhead is negligible.
| Attribute | Value |
|---|---|
| Backend | NFSv4 (in-cluster Service) |
| Capacity | 50 Gi (matches PV claim) |
| Access mode (writer) | ReadWriteMany |
| Access mode (readers, 10 slots) | ReadOnlyMany |
| Mount (Cloner) | /mnt/cloned (read-write) |
| Mount (Cloned Personas) | /mnt/cloned (read-only) |
| Network plane | OOBI only (eth0) — not the macvlan data plane |
| Multi-node co-location req | Cloner + NFS server on same node (role=infra) |
14. Key Files¶
| File | Description |
|---|---|
cloner/src/index.ts |
Main loop: startup, registration, heartbeat, poll |
cloner/src/cloner.ts |
browser engine-based cloning engine with stealth and URL rewriting |
cloner/src/health-monitor.ts |
ICMP ping loop and Prometheus metric emission |
cloner/src/controller-client.ts |
Dashboard API client (register, heartbeat, poll, complete) |
cloner/src/config.ts |
Environment variable configuration loading |
cloner/src/server.ts |
HTTP server: /healthz, /metrics |
k8s/80-cloner-nad.yaml |
Multus NAD for VLAN 40 ISP macvlan interface |
k8s/81-cloner-deployment.yaml |
Deployment, initContainer, routing script, Service |
k8s/82-cloner-network-policy.yaml |
NetworkPolicy: egress TCP 80/443, UDP 443/53, ICMP |
k8s/dut/35-nfs-server.yaml |
In-cluster NFS server backing the shared cloned-sites volume |
k8s/dut/36-cloned-sites-pvs.yaml |
11 static PV/PVC pairs (1 writer + 10 slot readers) |
k8s/dut/30-cni-dhcp-daemon.yaml |
CNI DHCP daemon DaemonSet — required for the cloner's VLAN 40 IPAM dhcp |
k8s/clone-personas/ |
10 Cloned Persona slots that serve the captured content |
dashboard/src/db/migrations/0017_clone_stack.sql |
clone_agents + clone_jobs tables |
dashboard/src/db/migrations/0018_clone_persona_slots.sql |
clone_persona_slots table |
dashboard/src/app/api/clone/ |
Dashboard REST API for cloner orchestration |