Time synchronization — TLSStress.Art¶
Read in your language: English · Português · Español
Scope status (post-Scope-Freeze 2026-05-10) — See ARCHITECTURE.md for the canonical 37 MÓDULOs + 7 Test Kinds + DOM/CPOS/PIE-PA safety architecture. ADRs 0014, 0019-0025 cover post-Freeze additions. Accurate time on every host of the test bed is not optional. Without it, the measurements this product is built to produce are forensically worthless and will mislead anyone reading them.
Why NTP is critical for THIS product¶
| Component | What goes wrong if clock drift > 5 s |
|---|---|
| TLS handshakes | Cert NotBefore/NotAfter checks reject valid certs → handshakes fail intermittently. The TLS Decrypt Probe sees the failure but cannot distinguish it from "NGFW broken" |
| Prometheus metrics | Out-of-order timestamps poison panels, SLO calculations break, "future timestamp" warnings clog logs |
| Cross-engagement comparison | Two engineers in different geographies running the same plan produce numbers timestamped at the wrong moment, breaking the "comparable across engagements" contract |
| Audit log forensics | License acceptance rows, fingerprint tamper-check events, access-broker grants — all need accurate created_at for legal value |
| Test plan phase timing | A 30-min plan needs 30 actual minutes; drift cuts phases short or extends them past the timeline |
| Cosign signatures (Phase 4) | Signature notBefore/notAfter windows depend on accurate clock |
| NGFW syslog correlation | Engineers cross-checking test-bed observations with FTD/Palo logs see timestamps that don't align — debugging takes 10× longer |
| Distributed tracing (Tempo/OTel) | Span timing requires synchronized clocks; off-by-1s gives spans negative duration |
| License accept timestamps | First-login acceptance recorded with wrong time looks like backdating |
Acceptable thresholds¶
| Drift | State | Action |
|---|---|---|
| < 100 ms | Ideal | Cross-host correlation perfect |
| 100 ms–1 s | OK | Most metrics + TLS work normally |
| 1 s–5 s | Degraded | TLS may fail intermittently; metrics correlation suffers; alert NodeClockSkewDegraded fires |
| > 5 s | Invalid | Runs are forensically worthless. Alert NodeClockSkewInvalid fires; abort active runs; do not start new ones |
| Daemon not running | Critical | Alert NodeNotSyncedToNTP fires; clock will drift unbounded |
These thresholds are encoded in k8s/dut/17-time-sync-rules.yaml.
Connected installations — recommended sources¶
In rough order of preference:
- Cisco-internal —
ntp1.cisco.com,ntp2.cisco.com(if accessible from your network) - Distinct stratum-1 anchors — at least three of:
time.cloudflare.com(anycast, very low jitter)time.google.com/time1.google.com…time4.google.comtime.nist.gov2.pool.ntp.org(community pool, decent fallback)- Local Cisco router or Nexus — many sites already run NTP master on their gear
Configure with chrony (preferred over systemd-timesyncd for production — better visibility, faster resync):
# /etc/chrony/chrony.conf
pool time.cloudflare.com iburst minpoll 5 maxpoll 7
pool time.google.com iburst minpoll 5 maxpoll 7
server 2.pool.ntp.org iburst minpoll 6 maxpoll 8
# Allow synchronization with sources that have stratum > 0
makestep 1 3
rtcsync
local stratum 10
sudo systemctl restart chronyd
sudo chronyc tracking
sudo chronyc sources -v
Air-gapped installations — internal NTP source¶
In a fully isolated data center the laptop staging the bundle has internet but the UCS does not. Public NTP servers are unreachable from the UCS. The recommended chain:
External GPS / atomic clock
│
▼
Site stratum-1 NTP server (Cisco router, Meinberg, GPS-disciplined NTP appliance)
│
▼
Lab stratum-2 (Nexus 9000 acting as NTP server, OR a dedicated Linux NTP host)
│
▼
UCS host(s) (chrony pointed at lab stratum-2)
Cisco Nexus 9000 as NTP server (most common path)¶
! On the Nexus:
configure terminal
ntp server 192.168.10.1 prefer ! site stratum-1
ntp server 192.168.10.2 ! backup
ntp source-interface mgmt0
ntp authenticate ! optional; recommended for compliance
end
write memory
show ntp peer-status
show ntp peers
UCS chrony pointed at the Nexus¶
# /etc/chrony/chrony.conf
server <nexus-mgmt-ip> iburst prefer minpoll 4 maxpoll 6
makestep 1 3
rtcsync
sudo systemctl restart chronyd
sudo chronyc waitsync 30 0.05 # wait up to 30 polls for offset < 50 ms
If your site does not have a stratum-1 anchor, the next-best is a GPS-disciplined NTP appliance (~US$1500-3000) installed in the lab and pointed at the visible sky. This is the standard for classified facilities.
What you absolutely cannot do in air-gap¶
- ❌ Point chrony at
pool.ntp.org— unreachable - ❌ Use systemd-timesyncd default config — also reaches public servers
- ❌ Run without NTP — RTC drift on commodity x86 is around 1 s per day; on commodity virtualized KVM hosts it can reach 1 minute per day under load. After a week, all measurements are invalid.
Multi-node deployments¶
In dual / tri / multi-node setups, every UCS host must point at the same stratum-2 source so they converge to the same time. Otherwise their offsets diverge by tens of milliseconds.
The alert MultiHostClockDivergence fires when any two hosts in the cluster differ by > 1 s, regardless of whether each is individually synced. The fix is always: point all hosts at the same upstream.
Verification (manual)¶
The bundled script scripts/check-time-sync.sh does this for you:
./scripts/check-time-sync.sh
# exit 0 = OK
# exit 1 = degraded (drift 1-5s, or strict mode + drift > 100ms)
# exit 2 = INVALID (drift > 5s, or daemon unsynchronized)
# exit 3 = no time daemon installed
JSON output for the dashboard:
./scripts/check-time-sync.sh --json
# {"status":"ok","drift_ms":4,"source":"NEXUS-MGMT","reason":""}
Strict mode (alerts even on small drift, useful before starting a critical run):
./scripts/check-time-sync.sh --strict
Run it on every host before a multi-node run:
for host in ucs-1 ucs-2 ucs-3; do
ssh $host "$(< scripts/check-time-sync.sh)" --json
done
In-cluster alerting¶
The PrometheusRule time-sync-validity (in k8s/dut/17-time-sync-rules.yaml) emits four alerts:
NodeClockSkewDegraded— drift > 1 s, severity warningNodeClockSkewInvalid— drift > 5 s, severity critical, ABORT runsNodeNotSyncedToNTP— daemon does not report sync status, severity criticalMultiHostClockDivergence— hosts disagree by > 1 s, severity warning
The Test Plan engine (Phase 3+, future enhancement) will refuse to start a run if any of these alerts is firing.
Symptoms checklist¶
You probably have a time-sync problem if:
- ✗ The TLS Decrypt Probe alternates between
activeandunknownrandomly - ✗ Grafana panels say "no data" even though Prometheus targets are UP
- ✗ Some agents register, some don't — register-retry logs show "expired" / "before NotBefore"
- ✗
kubectl get eventsshows certificate validation errors out of nowhere - ✗ The Dashboard's Runs page shows entries with
started_atin the future - ✗ Tracing spans in Tempo have negative durations
- ✗ Multiple operators of the same UCS report inconsistent results
When any of these surface: stop, run ./scripts/check-time-sync.sh, then investigate. Do not chase the symptom while the clock is wrong — you'll waste hours.
Recovering from severe drift¶
If drift > 60 s, chrony's default makestep won't apply (it caps at the first 3 updates). Force-step:
sudo chronyc -a 'burst 4/4'
sudo chronyc -a makestep
sudo chronyc tracking
If drift > 1 hour (e.g. fresh hardware with discharged RTC battery + no NTP), set time manually first then let chrony settle:
sudo timedatectl set-time '2026-05-06 14:35:00 UTC'
sudo systemctl restart chronyd
Related¶
MONITORING_TEST_VALIDITY.md— broader test-bed validity alertingAIRGAP_INSTALL.md— internal NTP setup is mandatory thereRUNBOOK_FIRST_INSTALL.md— clock check belongs in step 1.6 prerequisitesTLS_DECRYPT_MODE_VERIFICATION.en.md— the probe assumes the clock is correct