Skip to content

Syslog operations — deep operations guide

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions.

Audience: Operators running TLSStress.Art who need to investigate run failures, correlate metric anomalies with device events, or extend the syslog pipeline.

This document complements SYSLOG_CORRELATION.md (introduction + setup). Read that first. This file is the deep operations reference — query cookbook, vendor-specific advanced patterns, performance budgets, security model, troubleshooting recipes.

Mandatory prerequisite — syslog over OOBI only

Syslog traffic from lab elements (Nexus 9000, NGFW DUT, UCS hosts) MUST travel over the OOBI (out-of-band) management network — never over the data-plane VLANs.

Why: - The data plane carries the test traffic (Synthetic personas 10.1.0.0/16, Cloned personas 10.2.0.0/16, browser-engine agents 172.16.0.0/16, synthetic-load agents 172.17.0.0/16). Syslog packets on those VLANs would be measured BY the test bed as if they were test traffic — invalidating per-cycle metrics. - The OOBI subnet (default 192.168.90.0/24, VLAN 99) is reserved for management traffic by design — Prometheus scrapes, SNMP, kubectl, and syslog all live here. - The lab elements all have their management IPs on OOBI; their data-plane interfaces are downstream (toward the NGFW from the persona side, toward the agents from the test-bed side).

Enforcement: - Two NetworkPolicy resources (`syslog-oobi-only` + `syslog-deny-data-plane`) enforce this at the cluster level — see platform/observability/syslog-network-policy.yaml - The first one allows syslog ingress only from the OOBI CIDR (192.168.90.0/24) - The second explicitly denies syslog ingress from the data-plane CIDRs - Belt-and-suspenders by design — if one policy is accidentally weakened, the other still stops contamination

Operator obligations: 1. When configuring a device's syslog destination, ALWAYS use the destination's OOBI IP address (not its data-plane IP if it has one) 2. When configuring source-interface on Cisco devices, ALWAYS specify mgmt0 / the management interface 3. Periodically run a verification query (in the cookbook below) to confirm no off-network syslog has slipped through 4. If the lab uses a non-default OOBI CIDR, override via a kustomize overlay — don't edit the base manifest

Pipeline anatomy

                   OOBI network only (192.168.90.0/24)
   ┌────────────────────────────────────────────────────────────┐
   │                                                            │
┌──┴──────────┐  UDP/514   ┌─────────────────────┐             │
│ Nexus 9000  ├───────────▶│                     │             │
│ mgmt0       │            │  Promtail           │             │
└─────────────┘            │  syslog receiver    │             │
                           │                     │  Loki API   │
┌─────────────┐  UDP/514   │  - RFC 3164 + 5424  ├──────────▶┌─┴─────┐
│ NGFW DUT    ├───────────▶│  - port 1514 (pod)  │           │ Loki  │
│ mgmt iface  │  TCP/514   │  - port 30514 (NP)  │           └───┬───┘
└─────────────┘            │  - relabel pipeline │               │
                           │                     │           ┌───▼───┐
┌─────────────┐  UDP/514   │  Drops everything   │           │Grafana│
│ UCS host    ├───────────▶│  not from OOBI      │           └───────┘
│ rsyslog     │            │  (NetworkPolicy)    │
└─────────────┘            └─────────────────────┘

Loki query cookbook

Find every event from one device, last 30 minutes

{app="tlsstress", device_hostname="nexus-1.lab.example.com"}

Find warning+ events across all NGFW devices in the last hour

{app="tlsstress", device_type="ngfw", severity=~"warning|err|crit|alert|emerg"}
{app="tlsstress", device_type="ngfw"} |~ "(?i)(decrypt|ssl|tls)"
{app="tlsstress", device_type="ngfw"} | json | subtype="decryption"

Find Cisco Nexus port flaps

{app="tlsstress", device_type="nexus"}
  |~ "(?i)(port.*(flap|down|up))|interface.*state.*change"

Count events per minute by device, sorted descending

sum by (device_hostname) (
  rate({app="tlsstress"}[1m])
)

Find UCS kernel events (OOM, NIC reset, segfault)

{app="tlsstress", device_type="ucs", device_app="kernel"}
  |~ "(out of memory|oom-killer|segfault|nic.*reset|carrier (lost|down))"

Correlate with metric — show NGFW logs only when p99 latency > 500 ms

This is a two-store query — the metric side filters when to look, the log side returns the events:

{app="tlsstress", device_type="ngfw", severity=~"warning|err|crit"}
Open this query in Grafana Explore split-pane next to:
histogram_quantile(0.99, sum by (le) (rate(web_agent_request_duration_seconds_bucket[1m])))
Use the time-range selector to zoom into the latency spike. The log query will narrow to the same window.

Off-OOBI traffic detection (operator vigilance)

If everything is configured correctly, this query returns 0 rows. Non-zero rows indicate a device is sending syslog from a non-OOBI interface — typically because source-interface was not set:

{app="tlsstress"} | json | __syslog_message_remote_addr !~ "192\\.168\\.90\\."

If you see hits, find the offending device and run: - Cisco: show logging | include source → confirm source-interface mgmt0 set - Palo Alto: GUI → Device → Setup → Services → "Service Route Configuration" — Syslog should use the management interface - Fortinet: config log syslogd settingset source-ip <oobi-ip> confirmed

Per-vendor deep-dive — what to expect in the logs

Cisco Nexus 9000 (NX-OS)

Common message types to recognize:

Pattern Severity Meaning
%ETHPORT-5-IF_DOWN_* notice A port went down. Look for IF_DOWN_LINK_FAILURE (cable / NIC) vs IF_DOWN_ADMIN_DOWN (operator-initiated)
%ETH_PORT_CHANNEL-5-PORT_* notice Port-channel member up/down. Run-mid bring-up = potential traffic redistribution event
%MAC-3-MAC_FLAP_DETECTED error MAC moved between ports — typically indicates a loop or a host with multiple uplinks. Test bed degraded until resolved
%MTS_NOTIFICATION_AGENT-3-* error Internal NX-OS message-bus events; rare but indicate platform stress
%QUEUING-2-DROP critical QoS queue dropping traffic. Run is invalid for any throughput claim
%STM-2-LIMIT_REACHED critical A platform limit (route, MAC, ARP) reached. Reset cluster if persistent

Recommended NX-OS logging level: 5 (notification) for general categories, 4 (warning) for mts and monitor. See SYSLOG_CORRELATION.md for the exact config snippet.

Cisco FTD / ASA / Firepower

Pattern Severity Meaning
%ASA-3-302014 error Connection teardown — for a TLS test bed, look for the Reason: field. tcp reset by app-id often means decrypt-engine intervention
%ASA-6-725001 info TLS handshake started
%ASA-6-725002 info TLS handshake completed successfully
%ASA-6-725003 info TLS session resumed (session ticket / ID hit) — interesting for plans that demand fresh handshakes
%ASA-3-725007 error TLS handshake failed. Cross-check with Reason: field
%ASA-6-302013 info Built TCP connection — useful for connection-rate analysis
%ASA-4-411001 warning Interface state change

Recommended logging classes: crypto 5, ssl 5, connection 4. Higher rates on connection (severity 6) generate too many events for the lab to ingest comfortably.

Palo Alto (PAN-OS)

PAN-OS logs are structured (CSV-ish) by default; the Loki receiver parses RFC 5424 wrapping but the body remains comma-separated. Use | pattern or | logfmt in Loki to extract fields.

Field signal Meaning
subtype="decryption" + category="ssl-protocol-error" Decrypt failed — root cause in info field (typically cipher mismatch, unsupported version, cert chain)
subtype="decryption" + category="successful" Successful decrypt — useful for confirming policy actually fires
subtype="ssl-error" Specifically TLS-side error from the firewall's perspective
subtype="threat" + category="vulnerability" IPS hit during the test — may legitimately interrupt connections

For the TLS test bed, the most actionable subtypes are decryption (always) and traffic for action=block events.

Fortinet (FortiOS)

FortiOS uses key-value pairs. Loki's logfmt parser extracts these natively.

Pattern Severity Meaning
logid=0103040045 warning SSL handshake error — reason= field gives detail
logid=0103043200 info SSL session established
logid=0103040040 error TLS-related session ended abnormally
logid=0419016384 warning IPS engine flagged traffic

Recommended FortiOS filter: severity notification and set ssl enable (specifically forwards SSL-policy events).

Ubuntu UCS host (rsyslog)

The most useful host-level signals during a test:

device_app What to watch for
kernel OOM-killer, segfault, NIC errors, throttling — look for severity=err+
chronyd Clock-sync events — correlate with TIME_SYNC.md thresholds
kubelet Pod evictions, image pull failures during run
containerd Container restarts, OCI runtime errors
iptables Packet drops at the host firewall (uncommon but a fast diagnosis path)

Performance + retention

Event-rate budget

The Promtail receiver is configured with no rate-limit by default. Empirical capacity on a stock K3s node:

Configuration Sustainable events/sec
Single-node, default resource limits (200m CPU / 256 Mi mem) ~5,000
Single-node, bumped to 1 CPU / 1 Gi ~25,000
Multi-node with Promtail per node ~5,000 per node

For a single 30-min run on a fully loaded test bed, expect ~50,000–200,000 total events. Within budget for the default config.

If sustained events/sec exceeds budget, Promtail backpressures and the device's syslog client either buffers (Cisco IOS), drops (UDP), or holds connections (TCP). UDP drops are silent — TCP overflow logs an error on the device.

Loki retention

Default retention in this stack: 15 days. Configure via the Loki Helm values when you set up the observability stack.

For runs whose results need permanent retention beyond 15 days, two options: 1. Export logs at run completion via the Loki API → archive to your own object storage 2. Wait for Test Run Report Phase 4 (planned) — selected log excerpts will be embedded in the signed PDF for permanent forensic record

Storage sizing

Average log line size ~250 bytes including labels. 50,000 events/run × 250 bytes ≈ 12 MB/run. With 15-day retention and ~5 runs/day, expect ~900 MB / 15 days. Loki's chunked storage compresses this further (typically 4-6× compression).

Security model

What the syslog stack DOES

  • Receives syslog over UDP/TCP 514 from the OOBI network only
  • Forwards to Loki with structured labels for query
  • Indexes labels (small, fast); does NOT index full message bodies (large, slow)
  • Carries the same license/audience labels as Prometheus, so license-aware downstream tooling sees the same provenance

What the syslog stack does NOT do

  • Does NOT authenticate the sender (UDP and unauthenticated TCP). Anyone on the OOBI network can send events. Mitigation: NetworkPolicy restricts to OOBI; OOBI itself should be access-controlled at the switch level
  • Does NOT encrypt syslog in transit. RFC 5425 (syslog-over-TLS) is supported by Promtail but not enabled by default — set up requires shared certs. If your compliance demands encryption, file an issue and we'll prioritize the RFC 5425 config
  • Does NOT redact sensitive content. Many devices log session details, ARP tables, etc. that are not strictly secret but may exceed your operator audience policy. Audit before exporting Loki data outside the lab
  • Does NOT integrate with a SIEM. This is a TEST FORENSICS layer, not a security operations one

Threat model

Threat Mitigation
Attacker on OOBI sends fake syslog OOBI access is the attack surface — outside scope of this product
Compromised device sends malicious syslog NetworkPolicy + label-based dashboards limit blast radius to log-store; no exec from log content
Sensitive data accidentally logged Operator-side problem — review device logging classes in SYSLOG_CORRELATION.md before enabling
Off-OOBI source bypasses NetworkPolicy NetworkPolicy is enforced by the CNI; if your CNI does not implement NetworkPolicy (some bare-metal CNIs), test with kubectl describe networkpolicy and confirm the policy reports as enforced

Audit trail

Every syslog event in Loki carries: - The originating device hostname (RFC 5424 field, validated against the device's NTP-sync state at receive time — but only loosely) - The receive time (Promtail-side, set from the cluster clock) - The labels including app=tlsstress, license, audience

If the source clock is wrong by more than a few seconds (per TIME_SYNC.md), the event timestamps are misleading. Cross-reference the time-sync alerts before treating syslog timestamps as forensic.

Operational runbooks

"I see a p99 spike at 14:32"

1. Open: TLSStress.Art — Syslog Correlation (Lab Elements) dashboard
2. Set time range to 14:30–14:35
3. Look at the side-by-side panel — note any log volume spike
4. Click into the live syslog stream panel
5. Look for warning+ events from any device in that window
6. Drill into the per-device deep-dive panel for the noisy device
7. If NGFW shows decryption errors → check TLS Decrypt Probe state
   (TLSStress.Art — TLS Decrypt Mode dashboard)
8. If Nexus shows port flaps → check Topology Correlation dashboard
9. If UCS shows kernel/OOM → check Test-Bed Infrastructure Health

"I want to confirm no syslog is leaking off-OOBI"

1. Run the off-OOBI detection query in Grafana Explore (cookbook above)
2. If any rows return:
   a. Identify the device by __syslog_message_hostname
   b. Run "show logging | include source" (Cisco) or equivalent
   c. Reconfigure source-interface to mgmt0
3. Re-run the query — should return 0 rows

"Promtail is dropping events"

Symptoms: spikes in promtail_syslog_target_* metrics OR Loki query returns fewer events than expected.

1. kubectl logs -n observability deploy/promtail-syslog --tail=100
   Look for: "syslog target full" / "channel buffer full"
2. If buffer full:
   - Increase Promtail resource limits (default 200m CPU / 256Mi mem)
   - Or shard: run Promtail per-node with stable hash on device_hostname
3. If a specific device floods:
   - Check the device's logging level (lower from 7=debug to 5=notice)
   - Add a `drop` action in the relabel pipeline for that hostname

"Loki query is slow"

1. Query MUST start with a label matcher (e.g. {app="tlsstress"})
2. Avoid free-text grep on multi-day ranges — use device_type/severity to narrow
3. Use logfmt/json parsers AFTER label filtering, not before
4. Time range > 7 days requires the chunk-cache to be hot;
   first-query-of-the-day is slow, subsequent queries are fast

Future enhancements

Tracked for future iteration:

  • RFC 5425 syslog-over-TLS (encryption) — operator-controlled cert generation + chrony of the cert with cert-manager
  • Authenticated syslog via shared key (Cisco IOS supports this via logging server <ip> mac) — pairs with the relay model
  • Selective redaction at ingest (e.g. mask credit-card-shape strings before they hit Loki)
  • LLDP/CDP neighbor discovery → auto-generate device_role labels (planned, separate module)
  • Test Run Report Phase 4 — embed selected log excerpts in the signed PDF