Syslog operations — deep operations guide¶

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Audience: Operators running TLSStress.Art who need to investigate run failures, correlate metric anomalies with device events, or extend the syslog pipeline.

This document complements SYSLOG_CORRELATION.md (introduction + setup). Read that first. This file is the deep operations reference — query cookbook, vendor-specific advanced patterns, performance budgets, security model, troubleshooting recipes.

Mandatory prerequisite — syslog over OOBI only¶

Syslog traffic from lab elements (Nexus 9000, NGFW DUT, UCS hosts) MUST travel over the OOBI (out-of-band) management network — never over the data-plane VLANs.

Why: - The data plane carries the test traffic (Synthetic personas 10.1.0.0/16, Cloned personas 10.2.0.0/16, browser-engine agents 172.16.0.0/16, synthetic-load agents 172.17.0.0/16). Syslog packets on those VLANs would be measured BY the test bed as if they were test traffic — invalidating per-cycle metrics. - The OOBI subnet (default 192.168.90.0/24, VLAN 99) is reserved for management traffic by design — Prometheus scrapes, SNMP, kubectl, and syslog all live here. - The lab elements all have their management IPs on OOBI; their data-plane interfaces are downstream (toward the NGFW from the persona side, toward the agents from the test-bed side).

Enforcement: - Two NetworkPolicy resources (`syslog-oobi-only` + `syslog-deny-data-plane`) enforce this at the cluster level — see platform/observability/syslog-network-policy.yaml - The first one allows syslog ingress only from the OOBI CIDR (192.168.90.0/24) - The second explicitly denies syslog ingress from the data-plane CIDRs - Belt-and-suspenders by design — if one policy is accidentally weakened, the other still stops contamination

Operator obligations: 1. When configuring a device's syslog destination, ALWAYS use the destination's OOBI IP address (not its data-plane IP if it has one) 2. When configuring source-interface on Cisco devices, ALWAYS specify mgmt0 / the management interface 3. Periodically run a verification query (in the cookbook below) to confirm no off-network syslog has slipped through 4. If the lab uses a non-default OOBI CIDR, override via a kustomize overlay — don't edit the base manifest

Pipeline anatomy¶

                   OOBI network only (192.168.90.0/24)
   ┌────────────────────────────────────────────────────────────┐
   │                                                            │
┌──┴──────────┐  UDP/514   ┌─────────────────────┐             │
│ Nexus 9000  ├───────────▶│                     │             │
│ mgmt0       │            │  Promtail           │             │
└─────────────┘            │  syslog receiver    │             │
                           │                     │  Loki API   │
┌─────────────┐  UDP/514   │  - RFC 3164 + 5424  ├──────────▶┌─┴─────┐
│ NGFW DUT    ├───────────▶│  - port 1514 (pod)  │           │ Loki  │
│ mgmt iface  │  TCP/514   │  - port 30514 (NP)  │           └───┬───┘
└─────────────┘            │  - relabel pipeline │               │
                           │                     │           ┌───▼───┐
┌─────────────┐  UDP/514   │  Drops everything   │           │Grafana│
│ UCS host    ├───────────▶│  not from OOBI      │           └───────┘
│ rsyslog     │            │  (NetworkPolicy)    │
└─────────────┘            └─────────────────────┘

Loki query cookbook¶

Find every event from one device, last 30 minutes¶

{app="tlsstress", device_hostname="nexus-1.lab.example.com"}

Find warning+ events across all NGFW devices in the last hour¶

{app="tlsstress", device_type="ngfw", severity=~"warning|err|crit|alert|emerg"}

{app="tlsstress", device_type="ngfw"} |~ "(?i)(decrypt|ssl|tls)"

{app="tlsstress", device_type="ngfw"} | json | subtype="decryption"

Find Cisco Nexus port flaps¶

{app="tlsstress", device_type="nexus"}
  |~ "(?i)(port.*(flap|down|up))|interface.*state.*change"

Count events per minute by device, sorted descending¶

sum by (device_hostname) (
  rate({app="tlsstress"}[1m])
)

Find UCS kernel events (OOM, NIC reset, segfault)¶

{app="tlsstress", device_type="ucs", device_app="kernel"}
  |~ "(out of memory|oom-killer|segfault|nic.*reset|carrier (lost|down))"

Correlate with metric — show NGFW logs only when p99 latency > 500 ms¶

This is a two-store query — the metric side filters when to look, the log side returns the events:

{app="tlsstress", device_type="ngfw", severity=~"warning|err|crit"}

Open this query in Grafana Explore split-pane next to:

histogram_quantile(0.99, sum by (le) (rate(web_agent_request_duration_seconds_bucket[1m])))

Use the time-range selector to zoom into the latency spike. The log query will narrow to the same window.

Off-OOBI traffic detection (operator vigilance)¶

If everything is configured correctly, this query returns 0 rows. Non-zero rows indicate a device is sending syslog from a non-OOBI interface — typically because source-interface was not set:

{app="tlsstress"} | json | __syslog_message_remote_addr !~ "192\\.168\\.90\\."

If you see hits, find the offending device and run: - Cisco: show logging | include source → confirm source-interface mgmt0 set - Palo Alto: GUI → Device → Setup → Services → "Service Route Configuration" — Syslog should use the management interface - Fortinet: config log syslogd setting → set source-ip <oobi-ip> confirmed

Per-vendor deep-dive — what to expect in the logs¶

Cisco Nexus 9000 (NX-OS)¶

Common message types to recognize:

Pattern	Severity	Meaning
`%ETHPORT-5-IF_DOWN_*`	notice	A port went down. Look for `IF_DOWN_LINK_FAILURE` (cable / NIC) vs `IF_DOWN_ADMIN_DOWN` (operator-initiated)
`%ETH_PORT_CHANNEL-5-PORT_*`	notice	Port-channel member up/down. Run-mid bring-up = potential traffic redistribution event
`%MAC-3-MAC_FLAP_DETECTED`	error	MAC moved between ports — typically indicates a loop or a host with multiple uplinks. Test bed degraded until resolved
`%MTS_NOTIFICATION_AGENT-3-*`	error	Internal NX-OS message-bus events; rare but indicate platform stress
`%QUEUING-2-DROP`	critical	QoS queue dropping traffic. Run is invalid for any throughput claim
`%STM-2-LIMIT_REACHED`	critical	A platform limit (route, MAC, ARP) reached. Reset cluster if persistent

Recommended NX-OS logging level: 5 (notification) for general categories, 4 (warning) for mts and monitor. See SYSLOG_CORRELATION.md for the exact config snippet.

Cisco FTD / ASA / Firepower¶

Pattern	Severity	Meaning
`%ASA-3-302014`	error	Connection teardown — for a TLS test bed, look for the `Reason:` field. `tcp reset by app-id` often means decrypt-engine intervention
`%ASA-6-725001`	info	TLS handshake started
`%ASA-6-725002`	info	TLS handshake completed successfully
`%ASA-6-725003`	info	TLS session resumed (session ticket / ID hit) — interesting for plans that demand fresh handshakes
`%ASA-3-725007`	error	TLS handshake failed. Cross-check with `Reason:` field
`%ASA-6-302013`	info	Built TCP connection — useful for connection-rate analysis
`%ASA-4-411001`	warning	Interface state change

Recommended logging classes: crypto 5, ssl 5, connection 4. Higher rates on connection (severity 6) generate too many events for the lab to ingest comfortably.

Palo Alto (PAN-OS)¶

PAN-OS logs are structured (CSV-ish) by default; the Loki receiver parses RFC 5424 wrapping but the body remains comma-separated. Use | pattern or | logfmt in Loki to extract fields.

Field signal	Meaning
`subtype="decryption"` + `category="ssl-protocol-error"`	Decrypt failed — root cause in `info` field (typically cipher mismatch, unsupported version, cert chain)
`subtype="decryption"` + `category="successful"`	Successful decrypt — useful for confirming policy actually fires
`subtype="ssl-error"`	Specifically TLS-side error from the firewall's perspective
`subtype="threat"` + `category="vulnerability"`	IPS hit during the test — may legitimately interrupt connections

For the TLS test bed, the most actionable subtypes are decryption (always) and traffic for action=block events.

Fortinet (FortiOS)¶

FortiOS uses key-value pairs. Loki's logfmt parser extracts these natively.

Pattern	Severity	Meaning
`logid=0103040045`	warning	SSL handshake error — `reason=` field gives detail
`logid=0103043200`	info	SSL session established
`logid=0103040040`	error	TLS-related session ended abnormally
`logid=0419016384`	warning	IPS engine flagged traffic

Recommended FortiOS filter: severity notification and set ssl enable (specifically forwards SSL-policy events).

Ubuntu UCS host (rsyslog)¶

The most useful host-level signals during a test:

`device_app`	What to watch for
`kernel`	OOM-killer, segfault, NIC errors, throttling — look for `severity=err`+
`chronyd`	Clock-sync events — correlate with `TIME_SYNC.md` thresholds
`kubelet`	Pod evictions, image pull failures during run
`containerd`	Container restarts, OCI runtime errors
`iptables`	Packet drops at the host firewall (uncommon but a fast diagnosis path)

Performance + retention¶

Event-rate budget¶

The Promtail receiver is configured with no rate-limit by default. Empirical capacity on a stock K3s node:

Configuration	Sustainable events/sec
Single-node, default resource limits (200m CPU / 256 Mi mem)	~5,000
Single-node, bumped to 1 CPU / 1 Gi	~25,000
Multi-node with Promtail per node	~5,000 per node

For a single 30-min run on a fully loaded test bed, expect ~50,000–200,000 total events. Within budget for the default config.

If sustained events/sec exceeds budget, Promtail backpressures and the device's syslog client either buffers (Cisco IOS), drops (UDP), or holds connections (TCP). UDP drops are silent — TCP overflow logs an error on the device.

Loki retention¶

Default retention in this stack: 15 days. Configure via the Loki Helm values when you set up the observability stack.

For runs whose results need permanent retention beyond 15 days, two options: 1. Export logs at run completion via the Loki API → archive to your own object storage 2. Wait for Test Run Report Phase 4 (planned) — selected log excerpts will be embedded in the signed PDF for permanent forensic record

Storage sizing¶

Average log line size ~250 bytes including labels. 50,000 events/run × 250 bytes ≈ 12 MB/run. With 15-day retention and ~5 runs/day, expect ~900 MB / 15 days. Loki's chunked storage compresses this further (typically 4-6× compression).

Security model¶

What the syslog stack DOES¶

Receives syslog over UDP/TCP 514 from the OOBI network only
Forwards to Loki with structured labels for query
Indexes labels (small, fast); does NOT index full message bodies (large, slow)
Carries the same license/audience labels as Prometheus, so license-aware downstream tooling sees the same provenance

What the syslog stack does NOT do¶

Does NOT authenticate the sender (UDP and unauthenticated TCP). Anyone on the OOBI network can send events. Mitigation: NetworkPolicy restricts to OOBI; OOBI itself should be access-controlled at the switch level
Does NOT encrypt syslog in transit. RFC 5425 (syslog-over-TLS) is supported by Promtail but not enabled by default — set up requires shared certs. If your compliance demands encryption, file an issue and we'll prioritize the RFC 5425 config
Does NOT redact sensitive content. Many devices log session details, ARP tables, etc. that are not strictly secret but may exceed your operator audience policy. Audit before exporting Loki data outside the lab
Does NOT integrate with a SIEM. This is a TEST FORENSICS layer, not a security operations one

Threat model¶

Threat	Mitigation
Attacker on OOBI sends fake syslog	OOBI access is the attack surface — outside scope of this product
Compromised device sends malicious syslog	NetworkPolicy + label-based dashboards limit blast radius to log-store; no exec from log content
Sensitive data accidentally logged	Operator-side problem — review device logging classes in `SYSLOG_CORRELATION.md` before enabling
Off-OOBI source bypasses NetworkPolicy	NetworkPolicy is enforced by the CNI; if your CNI does not implement NetworkPolicy (some bare-metal CNIs), test with `kubectl describe networkpolicy` and confirm the policy reports as enforced

Audit trail¶

Every syslog event in Loki carries: - The originating device hostname (RFC 5424 field, validated against the device's NTP-sync state at receive time — but only loosely) - The receive time (Promtail-side, set from the cluster clock) - The labels including app=tlsstress, license, audience

If the source clock is wrong by more than a few seconds (per TIME_SYNC.md), the event timestamps are misleading. Cross-reference the time-sync alerts before treating syslog timestamps as forensic.

Operational runbooks¶

"I see a p99 spike at 14:32"¶

1. Open: TLSStress.Art — Syslog Correlation (Lab Elements) dashboard
2. Set time range to 14:30–14:35
3. Look at the side-by-side panel — note any log volume spike
4. Click into the live syslog stream panel
5. Look for warning+ events from any device in that window
6. Drill into the per-device deep-dive panel for the noisy device
7. If NGFW shows decryption errors → check TLS Decrypt Probe state
   (TLSStress.Art — TLS Decrypt Mode dashboard)
8. If Nexus shows port flaps → check Topology Correlation dashboard
9. If UCS shows kernel/OOM → check Test-Bed Infrastructure Health

"I want to confirm no syslog is leaking off-OOBI"¶

1. Run the off-OOBI detection query in Grafana Explore (cookbook above)
2. If any rows return:
   a. Identify the device by __syslog_message_hostname
   b. Run "show logging | include source" (Cisco) or equivalent
   c. Reconfigure source-interface to mgmt0
3. Re-run the query — should return 0 rows

"Promtail is dropping events"¶

Symptoms: spikes in promtail_syslog_target_* metrics OR Loki query returns fewer events than expected.

1. kubectl logs -n observability deploy/promtail-syslog --tail=100
   Look for: "syslog target full" / "channel buffer full"
2. If buffer full:
   - Increase Promtail resource limits (default 200m CPU / 256Mi mem)
   - Or shard: run Promtail per-node with stable hash on device_hostname
3. If a specific device floods:
   - Check the device's logging level (lower from 7=debug to 5=notice)
   - Add a `drop` action in the relabel pipeline for that hostname

"Loki query is slow"¶

1. Query MUST start with a label matcher (e.g. {app="tlsstress"})
2. Avoid free-text grep on multi-day ranges — use device_type/severity to narrow
3. Use logfmt/json parsers AFTER label filtering, not before
4. Time range > 7 days requires the chunk-cache to be hot;
   first-query-of-the-day is slow, subsequent queries are fast

Future enhancements¶

Tracked for future iteration:

RFC 5425 syslog-over-TLS (encryption) — operator-controlled cert generation + chrony of the cert with cert-manager
Authenticated syslog via shared key (Cisco IOS supports this via logging server <ip> mac) — pairs with the relay model
Selective redaction at ingest (e.g. mask credit-card-shape strings before they hit Loki)
LLDP/CDP neighbor discovery → auto-generate device_role labels (planned, separate module)
Test Run Report Phase 4 — embed selected log excerpts in the signed PDF

SYSLOG_CORRELATION.md — introduction, setup, vendor config snippets
TIME_SYNC.md — without accurate clocks, log correlation is meaningless
MONITORING_TEST_VALIDITY.md — broader validity-alert framework
AIRGAP_INSTALL.md — air-gap parent scenario for syslog (works internally to the lab)
USAGE_POLICY.md — license restrictions apply to logs collected here