Skip to content

Time synchronization — TLSStress.Art

Read in your language: English · Português · Español

Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. Accurate time on every host of the test bed is not optional. Without it, the measurements this product is built to produce are forensically worthless and will mislead anyone reading them.

Why NTP is critical for THIS product

Component What goes wrong if clock drift > 5 s
TLS handshakes Cert NotBefore/NotAfter checks reject valid certs → handshakes fail intermittently. The TLS Decrypt Probe sees the failure but cannot distinguish it from "NGFW broken"
Prometheus metrics Out-of-order timestamps poison panels, SLO calculations break, "future timestamp" warnings clog logs
Cross-engagement comparison Two engineers in different geographies running the same plan produce numbers timestamped at the wrong moment, breaking the "comparable across engagements" contract
Audit log forensics License acceptance rows, fingerprint tamper-check events, access-broker grants — all need accurate created_at for legal value
Test plan phase timing A 30-min plan needs 30 actual minutes; drift cuts phases short or extends them past the timeline
Cosign signatures (Phase 4) Signature notBefore/notAfter windows depend on accurate clock
NGFW syslog correlation Engineers cross-checking test-bed observations with FTD/Palo logs see timestamps that don't align — debugging takes 10× longer
Distributed tracing (Tempo/OTel) Span timing requires synchronized clocks; off-by-1s gives spans negative duration
License accept timestamps First-login acceptance recorded with wrong time looks like backdating

Acceptable thresholds

Drift State Action
< 100 ms Ideal Cross-host correlation perfect
100 ms–1 s OK Most metrics + TLS work normally
1 s–5 s Degraded TLS may fail intermittently; metrics correlation suffers; alert NodeClockSkewDegraded fires
> 5 s Invalid Runs are forensically worthless. Alert NodeClockSkewInvalid fires; abort active runs; do not start new ones
Daemon not running Critical Alert NodeNotSyncedToNTP fires; clock will drift unbounded

These thresholds are encoded in k8s/dut/17-time-sync-rules.yaml.

In rough order of preference:

  1. Cisco-internalntp1.cisco.com, ntp2.cisco.com (if accessible from your network)
  2. Distinct stratum-1 anchors — at least three of:
  3. time.cloudflare.com (anycast, very low jitter)
  4. time.google.com / time1.google.comtime4.google.com
  5. time.nist.gov
  6. 2.pool.ntp.org (community pool, decent fallback)
  7. Local Cisco router or Nexus — many sites already run NTP master on their gear

Configure with chrony (preferred over systemd-timesyncd for production — better visibility, faster resync):

# /etc/chrony/chrony.conf
pool time.cloudflare.com iburst minpoll 5 maxpoll 7
pool time.google.com iburst minpoll 5 maxpoll 7
server 2.pool.ntp.org iburst minpoll 6 maxpoll 8

# Allow synchronization with sources that have stratum > 0
makestep 1 3
rtcsync
local stratum 10
sudo systemctl restart chronyd
sudo chronyc tracking
sudo chronyc sources -v

Air-gapped installations — internal NTP source

In a fully isolated data center the laptop staging the bundle has internet but the UCS does not. Public NTP servers are unreachable from the UCS. The recommended chain:

External GPS / atomic clock
        │
        ▼
Site stratum-1 NTP server  (Cisco router, Meinberg, GPS-disciplined NTP appliance)
        │
        ▼
Lab stratum-2  (Nexus 9000 acting as NTP server, OR a dedicated Linux NTP host)
        │
        ▼
UCS host(s)   (chrony pointed at lab stratum-2)

Cisco Nexus 9000 as NTP server (most common path)

! On the Nexus:
configure terminal
ntp server 192.168.10.1 prefer    ! site stratum-1
ntp server 192.168.10.2          ! backup
ntp source-interface mgmt0
ntp authenticate                  ! optional; recommended for compliance
end
write memory
show ntp peer-status
show ntp peers

UCS chrony pointed at the Nexus

# /etc/chrony/chrony.conf
server <nexus-mgmt-ip> iburst prefer minpoll 4 maxpoll 6
makestep 1 3
rtcsync
sudo systemctl restart chronyd
sudo chronyc waitsync 30 0.05    # wait up to 30 polls for offset < 50 ms

If your site does not have a stratum-1 anchor, the next-best is a GPS-disciplined NTP appliance (~US$1500-3000) installed in the lab and pointed at the visible sky. This is the standard for classified facilities.

What you absolutely cannot do in air-gap

  • ❌ Point chrony at pool.ntp.org — unreachable
  • ❌ Use systemd-timesyncd default config — also reaches public servers
  • ❌ Run without NTP — RTC drift on commodity x86 is around 1 s per day; on commodity virtualized KVM hosts it can reach 1 minute per day under load. After a week, all measurements are invalid.

Multi-node deployments

In dual / tri / multi-node setups, every UCS host must point at the same stratum-2 source so they converge to the same time. Otherwise their offsets diverge by tens of milliseconds.

The alert MultiHostClockDivergence fires when any two hosts in the cluster differ by > 1 s, regardless of whether each is individually synced. The fix is always: point all hosts at the same upstream.

Verification (manual)

The bundled script scripts/check-time-sync.sh does this for you:

./scripts/check-time-sync.sh
# exit 0 = OK
# exit 1 = degraded (drift 1-5s, or strict mode + drift > 100ms)
# exit 2 = INVALID (drift > 5s, or daemon unsynchronized)
# exit 3 = no time daemon installed

JSON output for the dashboard:

./scripts/check-time-sync.sh --json
# {"status":"ok","drift_ms":4,"source":"NEXUS-MGMT","reason":""}

Strict mode (alerts even on small drift, useful before starting a critical run):

./scripts/check-time-sync.sh --strict

Run it on every host before a multi-node run:

for host in ucs-1 ucs-2 ucs-3; do
  ssh $host "$(< scripts/check-time-sync.sh)" --json
done

In-cluster alerting

The PrometheusRule time-sync-validity (in k8s/dut/17-time-sync-rules.yaml) emits four alerts:

  • NodeClockSkewDegraded — drift > 1 s, severity warning
  • NodeClockSkewInvalid — drift > 5 s, severity critical, ABORT runs
  • NodeNotSyncedToNTP — daemon does not report sync status, severity critical
  • MultiHostClockDivergence — hosts disagree by > 1 s, severity warning

The Test Plan engine (Phase 3+, future enhancement) will refuse to start a run if any of these alerts is firing.

Symptoms checklist

You probably have a time-sync problem if:

  • ✗ The TLS Decrypt Probe alternates between active and unknown randomly
  • ✗ Grafana panels say "no data" even though Prometheus targets are UP
  • ✗ Some agents register, some don't — register-retry logs show "expired" / "before NotBefore"
  • kubectl get events shows certificate validation errors out of nowhere
  • ✗ The Dashboard's Runs page shows entries with started_at in the future
  • ✗ Tracing spans in Tempo have negative durations
  • ✗ Multiple operators of the same UCS report inconsistent results

When any of these surface: stop, run ./scripts/check-time-sync.sh, then investigate. Do not chase the symptom while the clock is wrong — you'll waste hours.

Recovering from severe drift

If drift > 60 s, chrony's default makestep won't apply (it caps at the first 3 updates). Force-step:

sudo chronyc -a 'burst 4/4'
sudo chronyc -a makestep
sudo chronyc tracking

If drift > 1 hour (e.g. fresh hardware with discharged RTC battery + no NTP), set time manually first then let chrony settle:

sudo timedatectl set-time '2026-05-06 14:35:00 UTC'
sudo systemctl restart chronyd