Skip to content

ADR 0013 — Self-Upgrade (Meraki-style firmware management)

  • Status: Accepted (formalized 2026-05-12 with v3.7.0 — scaffolds shipping in v3.7.0)
  • Date: 2026-05-08
  • Deciders: TLSStress.Art project
  • Targets: v4.7 (this ADR locks design; implementation follows in PR-2..PR-7, ~3-4 week sprint, ships alongside the Cloner-as-Platform expansion since PR-3 here is the same code as Function 7 of the Cloner platform)

Context

The bench is a self-hosted, k3s-based, multi-pod system. Operators deploy it on their own infrastructure (single-node lab through multi-node prod per ADR 0011 topology axes) and own its operational lifecycle. Today there is no first-class upgrade path — operators manually pull new images, run database migrations, and hope nothing breaks.

That is unacceptable for the quality bar locked in project_quality_excellence_policy_2026_05_08:

"precisamos de profissionalismo, seriedade, perfeição em todos os aspectos (...) prevendo premiações futuras e reconhecimento do mercado global pela nossa excelência no projeto, na execução, na manutenção."

User declared 2026-05-08:

"existe uma opção chamada Firmware upgrades. Quando a engenharia da Meraki publica uma nova release, seja para correção de bugs ou evolução pra novas funcionalidades, existe um ambiente dedicado pra isso, lá as imagens são segmentadas por 'Recomendada', 'Candidata a recomendada' e 'Beta'. O Operador escolhe com qual imagem ele gostaria de aplicar, e com um simples botão de upgrade, ou schedule upgrade, nossa solução faria automaticamente seu próprio upgrade. O operador teria opção de Fallback em caso de problemas no caso de atualizações mal sucedidas (self-Healing)."

The Cisco Meraki firmware-upgrade UX is the gold standard for SaaS-managed network gear: operators don't think about which firmware is "the right one"; channel labels do the thinking; one-click apply or schedule for off-hours; automatic rollback when health checks fail = no 3 a.m. pager call.

Adopting this pattern for our self-hosted bench gives operators the same confidence: upgrade is one click; if it breaks, it un-breaks itself.

Decision

Add a first-class self-upgrade subsystem with these properties:

  1. Three release channels — Recommended (default) / Release Candidate / Beta. Operator picks the channel; channel label does the thinking.
  2. Apply now or schedule — operator clicks Upgrade or sets a future timestamp; the orchestrator handles the ceremony.
  3. Pre-upgrade snapshot — Layer 2 of the backup architecture (per project_backup_dr_strategy_2026_05_08) auto-fires before any apply; restore-on-failure uses Layer 3.
  4. Multi-signal post-upgrade health check — pod state, Postgres reachability, smoke-test plan, dashboard render, persona forwarding, BGP peer (when enabled), latency baseline.
  5. Automatic self-healing rollback — within 60 seconds of a hard-fail signal, restore the pre-upgrade snapshot and notify operator. Soft warns are logged but do not roll back.
  6. Signed releases — sigstore / cosign signature on every artefact (reuses the supply-chain stack landed in ADR 0005); operator-side verification before any apply.
  7. HQ release feed — TLSStress.Art HQ publishes a signed manifest of available builds + channels; the Cloner (per project_cloner_platform_2026_05_08) is the download channel.
  8. Air-gap fallback — operators without HQ egress can sideload upgrade-bundle.tar.gz via an admin upload form.

The upgrade system is always on — every deployment of the bench gets it; it cannot be disabled. (Operators who want a "frozen" deployment can simply not check for updates / not click upgrade. The subsystem stays passive until invoked.)

Architecture

Three release channels

Channel Audience Promotion criteria Typical lag from main
Recommended Production deployments, conservative operators ≥ 4 weeks in RC; zero P0/P1 incidents reported; manual SRE sign-off 4-8 weeks
Release Candidate Operators wanting features sooner; pilot environments All criteria for Beta met; ≥ 2 weeks in Beta; CI green; 100% smoke-test pass 2-4 weeks
Beta Early adopters; community contributors who want preview features Internal CI green; basic smoke tests pass; opt-in only 0-1 week

Default channel: Recommended. Operator opts into RC or Beta in Settings.

The same git tag may appear in all three channels over time, with the channel marker advancing as confidence grows. A tag never moves backwards (no demotion of a Recommended build to RC; if a Recommended build is found broken, a new build is published as the fix and the broken one is marked deprecated).

Operator UX (Dashboard → Settings → Updates)

┌─ Software updates ───────────────────────────────────────────────┐
│                                                                  │
│ Current version: v4.5.2 (released 2026-04-22)                    │
│ Channel: ● Recommended  ◯ Release Candidate  ◯ Beta             │
│                                                                  │
│ ── Available updates ──────────────────────────────────────────  │
│                                                                  │
│ ✦ NEW v4.5.3 in Recommended channel                             │
│   Released: 2026-05-06 · Soaked 12 days in RC · 0 incidents     │
│                                                                  │
│   What's new:                                                    │
│   • Fix: BGP saturation Annex L convergence-time graph rounding │
│   • Feat: capacity-fitted mode auto-detect for 7 new vendors    │
│   • Docs: Help Center entry for routing-table stress (3 langs)  │
│                                                                  │
│   Download size: 184 MB                                          │
│   Estimated upgrade duration: 15-30 min (with health checks)    │
│                                                                  │
│   [ Upgrade now ↑ ]   [ Schedule upgrade ⏰ ]   [ Skip version ]│
│                                                                  │
│ ── Schedule (when "Schedule upgrade" clicked) ────────────────── │
│                                                                  │
│   Date: [2026-05-10] Time: [02:00 UTC]                          │
│   ☑ Run preflight health check 30 min before                    │
│   ☑ Send completion notification to operator email              │
│   ☑ Auto-fallback if post-upgrade health check fails (self-heal)│
│                                                                  │
│   [ Confirm schedule ]                                          │
│                                                                  │
│ ── History ──────────────────────────────────────────────────── │
│                                                                  │
│ 2026-04-22  v4.5.2  ✓ Healthy (manual upgrade by operator)      │
│ 2026-03-30  v4.5.1  ✓ Healthy (scheduled upgrade)               │
│ 2026-03-12  v4.5.0  ⚠ Auto-rollback (health check failed)       │
│                       reason: persona-postgres pod CrashLoopBackOff
│                       → reverted to v4.4.7 in 2 min 14 sec      │
└──────────────────────────────────────────────────────────────────┘

Upgrade ceremony

End-to-end timeline (target ~15-30 min for the typical case; health-check phase is the wide variance):

T+0:00  Operator clicks Upgrade now (or scheduled time arrives)
T+0:01  Preflight health check
        - Postgres up, all pods Running
        - No active test runs (or operator confirms running ones may finish or be cancelled)
        - Disk space OK
        - Certificates valid
        - HQ reachable (or sideload bundle present)
        → if FAIL: abort, notify operator, no changes
T+0:02  Cloner downloads release artefact + signature manifest
        - Verify signature against TLSStress.Art HQ public key
        → if signature invalid: abort hard, log to audit
T+0:04  Pre-upgrade snapshot
        - Layer 2 of backup architecture
        - Snapshot ID recorded for potential rollback
T+0:06  Drain traffic gracefully
        - Block new test runs from starting
        - Existing runs allowed to finish up to 5 min (or operator cancel)
T+0:11  Apply upgrade
        - Database migrations run (forward; reverse migrations stored)
        - Rolling pod replacement in dependency order:
          dashboard → orchestrator → agents → personas
T+0:18  Post-upgrade health check (multi-signal, ~3-7 min)
        See "Self-healing — what triggers a rollback" section below
T+0:25  ✓ SUCCESS — upgrade history updated, operator notified

        OR if any HARD-FAIL signal triggers:
T+0:25  ✗ FAILURE — auto-rollback initiated
T+0:26  Restore from pre-upgrade snapshot (Layer 3 restore wizard, automated path)
T+0:28  Verify pre-upgrade state restored (smoke-test plan + persona ping)
T+0:29  ✓ Rolled back to previous version, operator notified with the
        failure reason captured for vendor support and any HQ telemetry
        (operator opt-in)

The 60-second target for "rollback decision triggered" → "rollback in progress" is a separate, narrower SLA than the full upgrade ceremony. Implementation: health-check signals stream into a decision engine that fires the rollback orchestrator the moment any HARD signal trips, even if other signals are still in flight.

Self-healing — what triggers a rollback

Post-upgrade health check is multi-signal. Each signal has its own threshold and severity:

Signal Threshold Severity
Any pod CrashLoopBackOff 1 occurrence within 5 min Hard fail → rollback
Postgres reachable + accepting writes Must pass within 60s Hard fail → rollback
Smoke-test plan returns valid Must pass on first attempt Hard fail → rollback
Dashboard render /dashboard returns 200 Must pass within 60s Hard fail → rollback
Persona forwarding plane intact (3 sample personas pingable) ≥ 2 of 3 pass Soft warn
BGP peer (if bgp_stress.enabled) re-establishes Must pass within 5 min Hard fail → rollback
Latency baseline within 20% of pre-upgrade Pass Soft warn
Disk space available > 10 GB free Must pass Hard fail → rollback
Cert chain still valid + > 30 days remaining Must pass Soft warn

Hard fail = automatic rollback within 60 seconds of detection. Soft warn = operator notified, audit log captures the warning, upgrade is held as success. Operator can choose to manually roll back if a soft-warn pattern looks problematic.

Health check tuning is iterative — start conservative (more hard fails) with strong observability into which signals fire, then ease specific signals to soft warn when historical false-positive data justifies the change.

Release artefact format

Each release ships as a signed bundle:

Component Purpose Signed?
Container image manifest List of all pod images and their digests
Database migration scripts Drizzle / SQL files; forward + reverse always provided
Configuration delta topology.yaml schema additions; breaking changes flagged
Release notes (3 langs: en + pt-BR + es) What changed and why
Rollback marker Earliest version this release can roll back to
Channel marker Which channel(s) this release is in (Recommended/RC/Beta)

Signature: TLSStress.Art HQ uses sigstore / cosign (already in the supply-chain stack — see ADR 0005). Operators verify the public key via pinned-cert.pem shipped with the bench. The public key is rotated on a documented schedule with overlapping validity windows.

Database migration discipline

Every migration ships forward AND reverse. CI gate: PR cannot merge without a successfully tested reverse migration.

This is the single most-likely-to-relax requirement under schedule pressure. It is not negotiable: relaxing it once breaks rollback forever for the affected migration's window.

For migrations that are genuinely irreversible (e.g. column drops where the original data cannot be recovered), the release notes flag this clearly and the rollback marker is set so operators cannot roll back past it.

Air-gap operator support

Operators without HQ egress can:

  1. Download upgrade-bundle-vX.Y.Z.tar.gz from a public mirror (or USB-stick from a connected machine)
  2. Upload via Dashboard → Settings → Updates → "Sideload bundle"
  3. Bundle includes the same signed artefacts as the online path
  4. Verification, snapshot, drain, apply, health check all run identically
  5. Telemetry on outcome stays local (or offloaded later when connectivity returns, operator opt-in)

Telemetry on upgrade outcomes

Operator opt-in (default OFF). When enabled, anonymous telemetry on upgrade success/failure goes to HQ. Useful for:

  • Engineering preempting bugs that hit multiple operators
  • Channel-promotion confidence (RC has no incidents in 4 weeks → safer to promote)
  • Quality metrics public reporting (e.g. "98.7% of upgrades succeeded on first attempt over the last quarter")

What telemetry never captures:

  • Operator's data, test results, or personas
  • IP addresses or identifying information
  • Full configuration; only minimal version + outcome metadata

Consequences

Positive

  • Quality differentiator: self-healing rollback is a marketing-grade feature. Procurement teams recognise the pattern from Meraki and trust it.
  • Operator confidence: one-click upgrade is the difference between operators staying current and operators getting stuck on v4.5 forever because "upgrades scare us".
  • Engineering velocity: when operators upgrade promptly, engineering ships fixes that reach the field. When they don't, the field freezes on old versions that vendor support has to back-port to indefinitely.
  • Aligned with quality policy: every principle (predictability / empathy / transparency / polish / docs first / maintenance discipline) maps directly onto a property of this design.
  • Reuses backup/DR architecture: pre-upgrade snapshot and rollback are not new infrastructure; they are calls to the Layer 2 + Layer 3 of the backup memo's design.

Negative / costs

  • Database migration discipline is expensive: forward + reverse for every migration is roughly 2x the effort of forward-only. Engineering culture has to internalise this. CI gate is the enforcement mechanism, but discipline can erode if the gate is ever bypassed.
  • HQ release-feed infrastructure: needs ongoing maintenance. Signing key rotation, channel-promotion workflow, public mirror for air-gap, telemetry ingestion. Not free; budget for it.
  • Health-check tuning is hard: false positives waste operator time + erode confidence; false negatives ship bad upgrades. Iterative tuning over many releases.
  • Multi-node upgrade is deferred: single-node case is clean. Multi-node with personas distributed + drain coordinated across nodes + Postgres in a specific pod is genuinely harder. Mark as v4.8+ work.
  • 6-9 minute target was unrealistic: revised to 15-30 minutes in this ADR. Operators forgive long planned upgrades; they do not forgive short ones that overrun.

Neutral

  • Channel-promotion automation: today (per Open Question 1) we default to manual SRE sign-off for RC → Recommended. This is ops cost but safer.
  • Skip-version policy: per Open Question 2, this needs explicit rules. CRITICAL upgrades cannot be skipped, only delayed.

Implementation roadmap

This ADR is PR-1 of a 7-PR feature. Remaining six are queued:

PR Scope Estimate
PR-1 (this) ADR 0013 + project memo + topology.yaml schema additions ~250 LoC docs
PR-2 HQ release-feed publishing pipeline (CI workflow that promotes builds across channels; signs artefacts) ~250 LoC GitHub Actions + Go signing helper
PR-3 Operator-side: Cloner Function 7 (poller + download + signature verify) — same code as Cloner-as-Platform PR-7 ~250 LoC (shared with cloner platform feature)
PR-4 Upgrade orchestrator (preflight + drain + apply + health check + auto-rollback) ~600 LoC
PR-5 Dashboard UI — Settings > Updates section (channel selection + history + schedule) ~450 LoC TSX
PR-6 Help Center entry "How upgrades work" + 2-min video tutorial (3 langs) ~250 LoC docs + Mux pipeline
PR-7 E2E test — upgrade-then-rollback dry run in CI; sideload-bundle path ~250 LoC
Total ~2,300 LoC across 7 PRs (PR-3 shared with cloner) ~3-4 week sprint

Suggested target: v4.7 (alongside Cloner-as-Platform expansion — they ship together because PR-3 here is the same code as Function 7 of the Cloner platform memo).

Open questions

These do not block ADR acceptance — captured for resolution during PR-2..PR-7.

  1. Channel promotion automation — RC → Recommended currently requires manual SRE sign-off. Should we automate "no incidents in 4 weeks → auto-promote" or keep human gate? Default: human gate (safer, doesn't add ops cost).
  2. Skip-version policy — operator can skip a version (button shown). What if v4.5.4 fixes a critical CVE in v4.5.3 — do we force-prompt for it? CRITICAL upgrades cannot be skipped, only delayed (max 7 days). Non-critical upgrades freely skippable.
  3. Air-gap operators — without Cloner egress to HQ release feed, sideload via admin upload form. UX needs first-class coverage; not a corner case.
  4. License verification on upgrade — does each upgrade re-check operator's license / acceptance? Probably yes (gates upgrade on license_accepted_at being current).
  5. Upgrade in dual-node / multi-node deployments — per-node rolling? Drain one at a time? Default v4.7: drain whole cluster, single brief downtime, clean state. Multi-node phased rolling is v4.8+ follow-up work.
  6. Telemetry collection during upgrade — failure modes feed back to HQ (with operator opt-in) so engineering can preempt the next operator hitting the same bug. Default: opt-in, anonymous, off by default.
  7. Public-key rotation cadence — TLSStress.Art HQ signing key rotates on what schedule? Annual with 6-month overlap recommended.
  8. Build provenance — SLSA Level 3? GitHub Actions native provenance is L3-compatible; we already have cosign signing. Adopt explicitly.

References

  • ADR 0005 — Supply chain (multiarch + cosign signing) — reused for release artefact signatures
  • ADR 0011 — Topology axes — multi-node upgrade is deferred per this dimension
  • ADR 0012 — BGP saturation — adds BGP peer to the post-upgrade health-check signal set when bgp_stress.enabled
  • project_self_upgrade_meraki_style_2026_05_08 (memory) — original vision memo
  • project_cloner_platform_2026_05_08 (memory) — Function 7 (patch fetching) is the download channel for this feature; PR-3 here is the same code
  • project_backup_dr_strategy_2026_05_08 (memory) — Layer 2 snapshot + Layer 3 restore are reused for pre-upgrade snapshot and rollback
  • project_quality_excellence_policy_2026_05_08 (memory) — every principle (predictability / empathy / transparency / polish / docs first / maintenance discipline) maps onto a property of this design
  • discuss_help_center_2026_05_08 (memory) — Help Center entry "How upgrades work" lands in PR-6
  • project_marketing_site_obligation_2026_05_08 (memory) — every release needs site update + video; the upgrade feature itself needs marketing copy explaining the self-healing differentiator