Time synchronization — TLSStress.Art¶

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. Accurate time on every host of the test bed is not optional. Without it, the measurements this product is built to produce are forensically worthless and will mislead anyone reading them.

Why NTP is critical for THIS product¶

Component	What goes wrong if clock drift > 5 s
TLS handshakes	Cert `NotBefore`/`NotAfter` checks reject valid certs → handshakes fail intermittently. The TLS Decrypt Probe sees the failure but cannot distinguish it from "NGFW broken"
Prometheus metrics	Out-of-order timestamps poison panels, SLO calculations break, "future timestamp" warnings clog logs
Cross-engagement comparison	Two engineers in different geographies running the same plan produce numbers timestamped at the wrong moment, breaking the "comparable across engagements" contract
Audit log forensics	License acceptance rows, fingerprint tamper-check events, access-broker grants — all need accurate `created_at` for legal value
Test plan phase timing	A 30-min plan needs 30 actual minutes; drift cuts phases short or extends them past the timeline
Cosign signatures (Phase 4)	Signature `notBefore`/`notAfter` windows depend on accurate clock
NGFW syslog correlation	Engineers cross-checking test-bed observations with FTD/Palo logs see timestamps that don't align — debugging takes 10× longer
Distributed tracing (Tempo/OTel)	Span timing requires synchronized clocks; off-by-1s gives spans negative duration
License accept timestamps	First-login acceptance recorded with wrong time looks like backdating

Acceptable thresholds¶

Drift	State	Action
< 100 ms	Ideal	Cross-host correlation perfect
100 ms–1 s	OK	Most metrics + TLS work normally
1 s–5 s	Degraded	TLS may fail intermittently; metrics correlation suffers; alert `NodeClockSkewDegraded` fires
> 5 s	Invalid	Runs are forensically worthless. Alert `NodeClockSkewInvalid` fires; abort active runs; do not start new ones
Daemon not running	Critical	Alert `NodeNotSyncedToNTP` fires; clock will drift unbounded

These thresholds are encoded in k8s/dut/17-time-sync-rules.yaml.

Connected installations — recommended sources¶

In rough order of preference:

Cisco-internal — ntp1.cisco.com, ntp2.cisco.com (if accessible from your network)
Distinct stratum-1 anchors — at least three of:
time.cloudflare.com (anycast, very low jitter)
time.google.com / time1.google.com … time4.google.com
time.nist.gov
2.pool.ntp.org (community pool, decent fallback)
Local Cisco router or Nexus — many sites already run NTP master on their gear

Configure with chrony (preferred over systemd-timesyncd for production — better visibility, faster resync):

# /etc/chrony/chrony.conf
pool time.cloudflare.com iburst minpoll 5 maxpoll 7
pool time.google.com iburst minpoll 5 maxpoll 7
server 2.pool.ntp.org iburst minpoll 6 maxpoll 8

# Allow synchronization with sources that have stratum > 0
makestep 1 3
rtcsync
local stratum 10

sudo systemctl restart chronyd
sudo chronyc tracking
sudo chronyc sources -v

Air-gapped installations — internal NTP source¶

In a fully isolated data center the laptop staging the bundle has internet but the UCS does not. Public NTP servers are unreachable from the UCS. The recommended chain:

External GPS / atomic clock
        │
        ▼
Site stratum-1 NTP server  (Cisco router, Meinberg, GPS-disciplined NTP appliance)
        │
        ▼
Lab stratum-2  (Nexus 9000 acting as NTP server, OR a dedicated Linux NTP host)
        │
        ▼
UCS host(s)   (chrony pointed at lab stratum-2)

Cisco Nexus 9000 as NTP server (most common path)¶

! On the Nexus:
configure terminal
ntp server 192.168.10.1 prefer    ! site stratum-1
ntp server 192.168.10.2          ! backup
ntp source-interface mgmt0
ntp authenticate                  ! optional; recommended for compliance
end
write memory
show ntp peer-status
show ntp peers

UCS chrony pointed at the Nexus¶

# /etc/chrony/chrony.conf
server <nexus-mgmt-ip> iburst prefer minpoll 4 maxpoll 6
makestep 1 3
rtcsync

sudo systemctl restart chronyd
sudo chronyc waitsync 30 0.05    # wait up to 30 polls for offset < 50 ms

If your site does not have a stratum-1 anchor, the next-best is a GPS-disciplined NTP appliance (~US$1500-3000) installed in the lab and pointed at the visible sky. This is the standard for classified facilities.

What you absolutely cannot do in air-gap¶

❌ Point chrony at pool.ntp.org — unreachable
❌ Use systemd-timesyncd default config — also reaches public servers
❌ Run without NTP — RTC drift on commodity x86 is around 1 s per day; on commodity virtualized KVM hosts it can reach 1 minute per day under load. After a week, all measurements are invalid.

Multi-node deployments¶

In dual / tri / multi-node setups, every UCS host must point at the same stratum-2 source so they converge to the same time. Otherwise their offsets diverge by tens of milliseconds.

The alert MultiHostClockDivergence fires when any two hosts in the cluster differ by > 1 s, regardless of whether each is individually synced. The fix is always: point all hosts at the same upstream.

Verification (manual)¶

The bundled script scripts/check-time-sync.sh does this for you:

./scripts/check-time-sync.sh
# exit 0 = OK
# exit 1 = degraded (drift 1-5s, or strict mode + drift > 100ms)
# exit 2 = INVALID (drift > 5s, or daemon unsynchronized)
# exit 3 = no time daemon installed

JSON output for the dashboard:

./scripts/check-time-sync.sh --json
# {"status":"ok","drift_ms":4,"source":"NEXUS-MGMT","reason":""}

Strict mode (alerts even on small drift, useful before starting a critical run):

./scripts/check-time-sync.sh --strict

Run it on every host before a multi-node run:

for host in ucs-1 ucs-2 ucs-3; do
  ssh $host "$(< scripts/check-time-sync.sh)" --json
done

In-cluster alerting¶

The PrometheusRule time-sync-validity (in k8s/dut/17-time-sync-rules.yaml) emits four alerts:

NodeClockSkewDegraded — drift > 1 s, severity warning
NodeClockSkewInvalid — drift > 5 s, severity critical, ABORT runs
NodeNotSyncedToNTP — daemon does not report sync status, severity critical
MultiHostClockDivergence — hosts disagree by > 1 s, severity warning

The Test Plan engine (Phase 3+, future enhancement) will refuse to start a run if any of these alerts is firing.

Symptoms checklist¶

You probably have a time-sync problem if:

✗ The TLS Decrypt Probe alternates between active and unknown randomly
✗ Grafana panels say "no data" even though Prometheus targets are UP
✗ Some agents register, some don't — register-retry logs show "expired" / "before NotBefore"
✗ kubectl get events shows certificate validation errors out of nowhere
✗ The Dashboard's Runs page shows entries with started_at in the future
✗ Tracing spans in Tempo have negative durations
✗ Multiple operators of the same UCS report inconsistent results

When any of these surface: stop, run ./scripts/check-time-sync.sh, then investigate. Do not chase the symptom while the clock is wrong — you'll waste hours.

Recovering from severe drift¶

If drift > 60 s, chrony's default makestep won't apply (it caps at the first 3 updates). Force-step:

sudo chronyc -a 'burst 4/4'
sudo chronyc -a makestep
sudo chronyc tracking

If drift > 1 hour (e.g. fresh hardware with discharged RTC battery + no NTP), set time manually first then let chrony settle:

sudo timedatectl set-time '2026-05-06 14:35:00 UTC'
sudo systemctl restart chronyd

MONITORING_TEST_VALIDITY.md — broader test-bed validity alerting
AIRGAP_INSTALL.md — internal NTP setup is mandatory there
RUNBOOK_FIRST_INSTALL.md — clock check belongs in step 1.6 prerequisites
TLS_DECRYPT_MODE_VERIFICATION.en.md — the probe assumes the clock is correct