Skip to content

Public Website Cloner

Component: cloner · Version: v3.6.0
Image: ghcr.io/nollagluiz/web-agent-cloner:v3.6.0
Namespace: web-agents · Replicas: 1

Scope status (post-Scope-Freeze 2026-05-10) — CLONER is now formally MÓDULO CLONER.Art with 9 functions (not just web cloning). See modules/cloner-art.md for the canonical 9-function breakdown and help-center/primers/obp-operator-bridge-proxy.md for the air-gap egress integration. This document describes the v3.6.0 web-cloning function (fn #1) in detail; ops detail for fns

2-#9 lives in CLONER_OPERATIONS.md.


Table of Contents

  1. Why It Exists
  2. What It Delivers
  3. Architecture Overview
  4. Network Design
  5. Technology Stack
  6. The Cloning Engine
  7. Dual-Network Routing
  8. Automation Pipeline
  9. Health Monitoring
  10. Database Schema
  11. Content Delivery to Cloned Personas
  12. Observability
  13. Configuration Reference
  14. Key Files

1. Why It Exists

The test-bed's primary goal is to measure NGFW TLS decryption performance under realistic traffic conditions. Synthetic Personas (the 20 simulated webservers) are carefully designed to stress-test specific aspects of TLS inspection — API patterns, streaming, large file downloads, e-commerce flows. However, they are purpose-built lab artifacts.

The Cloner exists to answer a different question:

How does the NGFW behave when inspecting real-world website content — with its actual HTML structure, third-party scripts, ad networks, tracking pixels, CDN references, obfuscated JavaScript, and fingerprinting techniques?

Real public websites exercise NGFW inspection engines differently than synthetic workloads. They include:

  • Deep JavaScript obfuscation — challenges application identification
  • Mixed content patterns — inline CSS, external fonts, cross-origin images
  • Dynamic payloads — pages that differ per visit or user agent
  • Anti-bot techniques — which stress-test TLS session tracking in the NGFW
  • Modern web frameworks — React, Next.js, Vue single-page applications with complex chunked loading

The Cloner captures all of this from the live internet and makes it available as a static, reproducible mirror that Cloned Persona webservers serve to the test agents — through the NGFW.


2. What It Delivers

Capability Detail
Real-world site capture Downloads full HTML, CSS, JS, images, fonts from any public URL
Reproducibility Frozen static mirror — same content on every test run
NGFW realism Agents access real site content through TLS inspection (not simulated payloads)
Zero NGFW bypass Cloner downloads via its own ISP interface (VLAN 40); agents access cloned content via NGFW (VLANs 200–209)
Automated pipeline Job queue, atomic claiming, status tracking, heartbeat — no manual steps after job creation
Anti-detection Stealth browser mode bypasses bot protection on target sites
Live health telemetry Internet connectivity metrics exposed to Prometheus in real time
Slot orchestration Dashboard assigns cloned sites to up to 10 Cloned Persona slots with a single API call

3. Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│  Cloner Pod (namespace: web-agents)                                 │
│                                                                     │
│  eth0 ─── K8s OOBI ──────────────────────────────────────────────  │
│              │                                                      │
│              ├── Dashboard  (http://dashboard:3000)                 │
│              │    ├── POST /api/clone/agents/register               │
│              │    ├── GET  /api/clone/agents/{id}/job               │
│              │    ├── POST /api/clone/agents/{id}/heartbeat         │
│              │    └── PATCH /api/clone/jobs/{id}                    │
│              └── CoreDNS  (10.96.0.10)                             │
│                                                                     │
│  net1 ─── VLAN 40 macvlan ────────────────────────────────────────  │
│              │  DHCP IP from upstream router                        │
│              └── Internet (TCP 80/443, UDP 443/QUIC, DNS, ICMP)    │
│                                                                     │
│  /mnt/cloned/{site}/   ─── NFS PVC: cloned-sites (RWX) ───────────  │
└─────────────────────────────────────────────────────────────────────┘
         ▼ writes over OOBI (eth0)
NFS Server (web-agents/nfs-server) — backed by hostPath
on the node hosting the cloner (role=infra in multi-node,
the only node in single-node).
         ▼ exported via NFSv4 / OOBI ClusterIP
NFS PVCs (per-namespace, ROX) — 10 slots mount read-only
         ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Clone Persona Pods (clone-persona-1 … clone-persona-10)           │
│  VLANs 200–209 / 10.2.{1..10}.0/27                                │
│  Caddy: file_server /mnt/cloned/{SITE_NAME}                        │
└─────────────────────────────────────────────────────────────────────┘
         ▲ NGFW inspects TLS
┌─────────────────────────────────────────────────────────────────────┐
│  Test Agents (browser-engine VLAN 20, synthetic-load VLAN 30)                      │
│  https://10.2.{n}.{x}/ → routed via NGFW                          │
└─────────────────────────────────────────────────────────────────────┘

The Cloner and the test agents operate on completely separate network paths:

  • Cloner downloads: net1 (VLAN 40) → ISP → public internet. The NGFW is not in this path.
  • Agents test: net1 (VLAN 20/30) → NGFW → Cloned Persona Caddy. The NGFW is in this path.

This separation is fundamental. The NGFW inspects the content that was cloned, but plays no role in the cloning itself — avoiding circular TLS dependencies and contaminated latency measurements.


4. Network Design

Two Network Interfaces

Interface Network Purpose
eth0 K8s OOBI (cluster default) Control plane: Dashboard, CoreDNS, Prometheus scraping
net1 VLAN 40 macvlan, DHCP Data plane: internet downloads

The macvlan interface on VLAN 40 is provisioned by Multus CNI using the cloner-isp NetworkAttachmentDefinition (k8s/80-cloner-nad.yaml). It attaches to eth1.40 on the physical trunk, connecting the pod directly to the upstream ISP router segment via the Nexus 9000.

Forced DNS (dnsPolicy: None)

The pod uses a custom resolver list that overrides DHCP-provided DNS entirely:

8.8.8.8          Google Public DNS  — public name resolution
208.67.222.222   OpenDNS            — redundant public resolver
10.96.0.10       k3s CoreDNS        — internal K8s service names

DNS queries to the two public resolvers are non-RFC1918 addresses, so they automatically receive the iptables fwmark and exit via net1 (ISP). CoreDNS queries remain on eth0 (OOBI). The result is clean DNS segregation with no split-horizon configuration.


5. Technology Stack

Layer Technology Role
Runtime Node.js (ESM) Main process, async event loop
Browser automation browser engine Headless Chromium for site navigation and resource capture
Stealth layer playwright-extra + puppeteer-extra-plugin-stealth Bypasses anti-bot detection on target sites
Network routing Linux policy routing (ip rule, ip route) + iptables mangle Dual-NIC traffic segregation
Multus CNI k8s.v1.cni.cncf.io/networks Secondary network interface (VLAN 40) attachment
Storage Kubernetes PersistentVolumeClaim (local-path, 20 Gi) Persistent mirror storage at /mnt/cloned/
Orchestration Dashboard REST API (PostgreSQL-backed) Job queue, agent registry, slot assignment
Metrics Prometheus text format, exposed on :8081/metrics Internet health telemetry
Container Alpine 3.21 (initContainer), custom Node.js image (main) initContainer for routing setup, main image for cloning

6. The Cloning Engine

Source: cloner/src/cloner.ts

The cloning engine uses a full Chromium browser (not a simple HTTP crawler) because modern websites require JavaScript execution to render their full asset tree. A curl-based approach would miss dynamically injected scripts, lazy-loaded images, and SPAgenerated routes.

Browser Configuration

const browser = await chromium.launch({
  headless: false,   // Xvfb-based display — not truly headless, harder to detect
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-blink-features=AutomationControlled',
    '--window-size=1920,1080',
  ],
});

const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...',
  viewport: { width: 1920, height: 1080 },
  locale: 'en-US',
  timezoneId: 'America/New_York',
});

The --disable-blink-features=AutomationControlled flag removes the Blink automation hint that many anti-bot systems check. Combined with puppeteer-extra-plugin-stealth, the browser presents as a normal user session.

await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});

This script runs before any page JavaScript, ensuring navigator.webdriver is undefined rather than true — the standard fingerprint check used by Cloudflare, Akamai, and similar bot protection systems.

Full Resource Interception

Every network response is intercepted and captured in memory before fulfillment:

await context.route('**/*', async (route) => {
  const response = await route.fetch();
  const body = await response.body();
  resources.set(route.request().url(), {
    data: body,
    contentType: response.headers()['content-type'],
  });
  await route.fulfill({ response });
});

This captures all resources — HTML, CSS, JavaScript, images, fonts, JSON API responses, SVGs — regardless of whether they are loaded synchronously or asynchronously by JavaScript.

Lazy-Load Trigger

After the initial networkidle wait, the engine scrolls to the bottom of the page to trigger viewport-based lazy loading:

await page.goto(url, { waitUntil: 'networkidle', timeout: 60_000 });
await page.evaluate(() =>
  window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' }),
);
await page.waitForTimeout(2_000);

Asset Writing and URL Rewriting

Captured resources are written to disk under CLONE_STORAGE_PATH/{personaName}/ following a path structure derived from the resource URLs:

  • Same-origin resources: path mirrors the URL path (/about/about/index.html)
  • Cross-origin resources: written to _ext/{hostname}/{sanitized_path}
  • Extensionless paths: .html appended automatically

HTML files are post-processed to rewrite absolute same-origin URLs to relative paths, making the mirror self-contained:

function rewriteHtml(html: string, baseUrl: string): string {
  const origin = new URL(baseUrl).origin;
  return html
    .replace(/(href|src|action)="${origin}/g, '$1="')
    .replace(/(href|src|action)='${origin}/g, "$1='");
}

7. Dual-Network Routing

Source: routing-init initContainer, k8s/81-cloner-deployment.yaml

The routing setup is performed by an Alpine Linux initContainer with NET_ADMIN and NET_RAW capabilities. It runs before the main cloner container starts and configures the kernel so that:

  • All non-RFC1918 traffic exits via net1 (ISP)
  • All RFC1918 traffic stays on eth0 (K8s OOBI)

Implementation

# 1. Wait for DHCP on net1 (up to 60 s)
ISP_ADDR=$(ip -4 addr show net1 | awk '/inet / {print $2; exit}')

# 2. Routing table 100: default exit via ISP gateway
ip route add table 100 default via "$ISP_GW" dev net1

# 3. Policy rule: fwmark 100 → table 100
ip rule add fwmark 100 table 100 priority 10

# 4. iptables mangle: mark all non-RFC1918 OUTPUT packets
for RFC1918 in 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 127.0.0.0/8; do
  iptables -t mangle -A OUTPUT -d "$RFC1918" -j RETURN
done
iptables -t mangle -A OUTPUT -j MARK --set-mark 100

Traffic covered by the ISP routing mark

Protocol Port Purpose
TCP 80 HTTP downloads
TCP 443 HTTPS/HTTP2 downloads
UDP 443 HTTP/3 QUIC downloads
UDP 53 DNS queries to 8.8.8.8 / 208.67.222.222
ICMP Health monitor pings

Gateway Discovery for Health Monitor

The initContainer writes the discovered ISP gateway to a shared emptyDir volume:

echo "$ISP_GW" > /var/isp-config/gateway.txt
echo "$ISP_IP" > /var/isp-config/isp-ip.txt

The main container reads gateway.txt at startup to begin pinging the ISP gateway.


8. Automation Pipeline

Source: cloner/src/index.ts, cloner/src/controller-client.ts

The cloner operates as a worker agent in a producer-consumer queue backed by PostgreSQL.

Startup Sequence

1. Start HTTP server on :8081  (healthz + metrics + file serving)
2. Start internet health monitor  (ping loop every 10 s)
3. Register with Dashboard  (POST /api/clone/agents/register, with retries)
4. Start heartbeat loop  (POST /api/clone/agents/{id}/heartbeat every 15 s)
5. Start poll loop  (GET /api/clone/agents/{id}/job every 10 s)

Registration uses exponential backoff (5 s × attempt, capped at 60 s), handling the case where the Dashboard is not yet ready when the cloner pod starts.

Job Claiming (SKIP LOCKED)

Jobs are claimed atomically via a PostgreSQL SELECT ... FOR UPDATE SKIP LOCKED pattern. This ensures that if multiple Cloner replicas are running simultaneously, each job is processed by exactly one agent:

UPDATE clone_jobs
SET status = 'running', agent_id = $1, started_at = now()
WHERE id = (
  SELECT id FROM clone_jobs
  WHERE status = 'pending'
  ORDER BY created_at ASC
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)

Job Lifecycle

Dashboard operator creates job
         │
         ▼
clone_jobs.status = 'pending'
         │
         ▼  (cloner polls, claims with SKIP LOCKED)
clone_jobs.status = 'running'
clone_agents.status = 'running'
         │
         ├── Success ──▶ status = 'completed', asset_count, total_bytes recorded
         └── Failure ──▶ status = 'failed', error_message recorded
                          clone_agents.status = 'idle'

Sequential Execution

The poll loop is guarded by a busy flag — only one job runs at a time per cloner instance. This prevents concurrent Chromium sessions from competing for the same ISP bandwidth and ensures deterministic behavior:

let busy = false;
const pollTimer = setInterval(async () => {
  if (busy) return;
  busy = true;
  // ... clone the site ...
  busy = false;
}, config.pollIntervalMs);

9. Health Monitoring

Source: cloner/src/health-monitor.ts

The health monitor runs a parallel ping loop every 10 seconds, tracking connectivity on three targets:

Target Discovery Purpose
8.8.8.8 Fixed Google Public DNS — confirms ISP internet access
1.1.1.1 Fixed Cloudflare Public DNS — redundant internet check
ISP gateway From /var/isp-config/gateway.txt Confirms L3 reachability to upstream router

Gateway Discovery Fallback Chain

If the /var/isp-config/gateway.txt file is not present (e.g., in Docker Compose mode without the initContainer), the monitor falls back through:

  1. ip -4 route show dev net1 default — ISP-specific default route
  2. ip -4 route show dev eth1 default — alternate ISP interface name
  3. ip -4 route show default — container's general default gateway

Ping Implementation

Uses ICMP via the system ping command (ping -c 1 -W 3 <target>) and parses the RTT from the output. The check is non-blocking and runs all three targets concurrently.


10. Database Schema

Migrations: dashboard/src/db/migrations/0017_clone_stack.sql, 0018_clone_persona_slots.sql

clone_agents

Tracks registered cloner pods. Each pod registers on startup and updates last_seen_at via heartbeat.

Column Type Description
id text PK Unique agent identifier (cloner-{hostname}-{random})
hostname text Pod hostname
pod_ip text eth0 IP of the pod
version text Cloner image version
status enum unknown / idle / running / error / offline
last_seen_at timestamptz Last heartbeat timestamp

clone_jobs

The job queue. Each row is one website clone request.

Column Type Description
id uuid PK Job identifier
url text Target public URL to clone
persona_name text FK → personas Destination persona slot name
status enum pendingrunningcompleted / failed / cancelled
agent_id text FK → clone_agents Which cloner is/was executing
asset_count integer Number of resources captured
total_bytes bigint Total bytes written to disk
error_message text Failure description if status=failed
started_at timestamptz When the cloner began executing
completed_at timestamptz When the job finished

clone_persona_slots

Tracks the 10 Cloned Persona webserver slots. Pre-populated with 10 rows on migration.

Column Type Description
slot_id integer PK (1–10) Slot number
site_name text Site currently assigned (NULL = inactive)
replicas integer Desired Caddy replicas (0 = stopped)
status enum inactive / starting / active / error
vlan_id integer (generated) Always 200 + slot_id - 1
subnet text (generated) Always 10.2.{slot_id}.0/27
namespace text (generated) Always clone-persona-{slot_id}

11. Content Delivery to Cloned Personas

Once cloning completes, the content lives in the cloned-sites PVC under /mnt/cloned/{site_name}/. The Cloned Persona webservers (10 slots, VLANs 200–209) mount this PVC read-only and serve the content directly.

Slot Activation Flow

Operator: PATCH /api/clone/persona-slots/3
          { "siteName": "example-shop", "replicas": 1 }
                    │
                    ▼
      Dashboard:  1. Update DB: clone_persona_slots row 3
                                site_name = 'example-shop'
                                replicas = 1
                                status = 'starting'
                  2. K8s API: patch ConfigMap clone-persona-3-config
                              SITE_NAME = "example-shop"
                  3. K8s API: scale Deployment caddy (clone-persona-3) → 1
                    │
                    ▼
  Stakater Reloader: detects ConfigMap change → triggers rolling restart
                    │
                    ▼
      Caddy pod:   SITE_NAME=example-shop
                   root * /mnt/cloned/example-shop
                   file_server { precompressed gzip }
                    │
                    ▼
Test agents access:  https://10.2.3.{x}/   →   NGFW   →   Caddy   →   /mnt/cloned/example-shop/

No Reverse Proxy

Cloned Persona pods do not proxy to any external service. Content is served directly from the PVC via Caddy's file_server directive. This means:

  • Zero OOBI traffic from Cloned Persona pods during test execution
  • No intermediate hop that could mask or alter the content the NGFW inspects
  • No dependency on the Cloner pod being healthy during test runs — the PVC is independent

12. Observability

Prometheus Metrics (:8081/metrics)

Metric Type Description
cloner_internet_up{target} gauge 1 if last ICMP ping to 8.8.8.8 or 1.1.1.1 succeeded
cloner_internet_any_up gauge 1 if at least one internet target is reachable
cloner_gateway_up{gateway} gauge 1 if DHCP gateway responds to ICMP
cloner_ping_rtt_ms{target} gauge Round-trip time of last successful ping (ms)
cloner_ping_checks_total{target,result} counter Cumulative ping checks by result (ok / fail)

ServiceMonitor

A Prometheus Operator ServiceMonitor (k8s/71-service-monitors.yaml) scrapes :8081/metrics every 15 seconds. No relabeling is needed — metrics are already cloner-specific.

Grafana Dashboard

The "Cloner — Internet Health & Jobs" row in observability/grafana/dashboards/web-agent-cluster.json contains:

Panel Query
Acesso Internet (ISP) cloner_internet_any_up
Ping 8.8.8.8 cloner_internet_up{target="8.8.8.8"}
Ping 1.1.1.1 cloner_internet_up{target="1.1.1.1"}
Gateway ISP cloner_gateway_up
Ping RTT (ms) cloner_ping_rtt_ms time series
Ping RTT Gateway ISP cloner_ping_rtt_ms{target=~"gateway.*"} time series

13. Configuration Reference

ConfigMap cloner-config

Variable Default Description
CONTROLLER_URL http://dashboard:3000 Dashboard base URL
CLONE_STORAGE_PATH /mnt/cloned Root directory for cloned site mirrors
SERVE_PORT 8081 Port for HTTP server (healthz, metrics)
POLL_INTERVAL_MS 10000 How often to poll Dashboard for new jobs (ms)
HEARTBEAT_INTERVAL_MS 15000 How often to send heartbeat to Dashboard (ms)

Secret cloner-secrets

Variable Description
CONTROLLER_TOKEN Bearer token matching AGENT_API_TOKEN in Dashboard secrets

Resource Limits

Resource Request Limit
CPU 500m 2000m
Memory 1 Gi 4 Gi
Ephemeral storage 512 Mi 2 Gi

The 4 Gi memory limit accommodates Chromium's memory usage during page loading. Sites with heavy JavaScript frameworks can use 1–2 Gi during initial render.

Storage — in-cluster NFS (cloned-sites)

The cloned-sites volume is shared between the Cloner (writer) and the 10 Clone Persona slot pods (readers). Because PVCs are namespace-scoped and the slots live in 10 separate namespaces, the share is provisioned as 11 static PV/PVC pairs all pointing at the same NFS export served by an in-cluster NFS server.

Component Manifest Purpose
NFS server Deployment + Service k8s/dut/35-nfs-server.yaml Single-replica NFS server. nodeAffinity prefers role=infra; falls through on single-node. Backed by hostPath /var/lib/agent-cluster/cloned-sites. Recreate strategy keeps a single writer. A prepare-share initContainer chowns the export to UID 10004 once before the daemon boots.
Static PVs + PVCs k8s/dut/36-cloned-sites-pvs.yaml 1 RWX writer pair (web-agents/cloned-sites) + 10 ROX reader pairs (clone-persona-N/cloned-sites). All point to the same NFS export root. storageClassName: "" prevents CSI provisioners from racing to claim them.

Topology (multi-node):

  • Cloner pod runs on role=infra (UCS-4) — same node as the NFS server.
  • Clone-persona slot pods run on role=ngfw-dut (UCS-1).
  • All NFS traffic flows over the OOBI control plane (eth0 → ClusterIP Service nfs-server.web-agents.svc.cluster.local:2049). The data plane (net1 macvlan, VLAN 40 ISP and VLANs 200–209 slot subnets) carries no storage I/O — it stays exclusive for agent ↔ NGFW ↔ persona traffic so timing measurements remain uncontaminated.

Topology (single-node):

  • Everything lands on the only node. The NFS server's hostPath stays local; clients still mount via NFS, but the round-trip is loopback — overhead is negligible.
Attribute Value
Backend NFSv4 (in-cluster Service)
Capacity 50 Gi (matches PV claim)
Access mode (writer) ReadWriteMany
Access mode (readers, 10 slots) ReadOnlyMany
Mount (Cloner) /mnt/cloned (read-write)
Mount (Cloned Personas) /mnt/cloned (read-only)
Network plane OOBI only (eth0) — not the macvlan data plane
Multi-node co-location req Cloner + NFS server on same node (role=infra)

14. Key Files

File Description
cloner/src/index.ts Main loop: startup, registration, heartbeat, poll
cloner/src/cloner.ts browser engine-based cloning engine with stealth and URL rewriting
cloner/src/health-monitor.ts ICMP ping loop and Prometheus metric emission
cloner/src/controller-client.ts Dashboard API client (register, heartbeat, poll, complete)
cloner/src/config.ts Environment variable configuration loading
cloner/src/server.ts HTTP server: /healthz, /metrics
k8s/80-cloner-nad.yaml Multus NAD for VLAN 40 ISP macvlan interface
k8s/81-cloner-deployment.yaml Deployment, initContainer, routing script, Service
k8s/82-cloner-network-policy.yaml NetworkPolicy: egress TCP 80/443, UDP 443/53, ICMP
k8s/dut/35-nfs-server.yaml In-cluster NFS server backing the shared cloned-sites volume
k8s/dut/36-cloned-sites-pvs.yaml 11 static PV/PVC pairs (1 writer + 10 slot readers)
k8s/dut/30-cni-dhcp-daemon.yaml CNI DHCP daemon DaemonSet — required for the cloner's VLAN 40 IPAM dhcp
k8s/clone-personas/ 10 Cloned Persona slots that serve the captured content
dashboard/src/db/migrations/0017_clone_stack.sql clone_agents + clone_jobs tables
dashboard/src/db/migrations/0018_clone_persona_slots.sql clone_persona_slots table
dashboard/src/app/api/clone/ Dashboard REST API for cloner orchestration