feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4) #330
Open
johnride
wants to merge 1 commits from
feat/fleet-ch4-agent-upgrade into feat/fleet-ch3-log-streaming
pull from: feat/fleet-ch4-agent-upgrade
merge into: NationTech:feat/fleet-ch3-log-streaming
NationTech:master
NationTech:feat/fleet-ch2-operator-recovery
NationTech:feat/fleet-device-exec-logs
NationTech:feat/zitadel-web-pkce-and-human-user
NationTech:feat/jwt-bearer-openbao-auth
NationTech:feat/fleet-ch5-graceful-deploy-upgrade
NationTech:feat/fleet-ch3-log-streaming
NationTech:feat/add-claims-for-openbao
NationTech:refactor/move-zitadel-jwt-to-module
NationTech:feat/fleet-operator-real-data
NationTech:docs/fleet-secrets-device-access
NationTech:chore/fleet-operator-prune-mock-dtos
NationTech:chore/rename-release-to-publish
NationTech:refactor/config-namespace-env-var
NationTech:feat/fleet-staging-openbao
NationTech:feat/auth-add-next-url-redirect
NationTech:pr/harmony-sso-example
NationTech:feat/unified-config-and-secrets
NationTech:ci/fleet-argo-cd
NationTech:ci/fleet-operator-release-pipeline
NationTech:feat/on-device-key-gen
NationTech:feat/install-gitea
NationTech:feat/v0-3-logs-companion
NationTech:refactor/smoke-companion-minimal
NationTech:feat/smoke-test-contract
NationTech:feat/iobench-redpanda-profile
NationTech:feat/v0-3-dashboard-role-enforcement
NationTech:feat/v0-3-init-containers
NationTech:feat/v0-3-operator-restart-baseline
NationTech:feat/fleet-e2e-x86
NationTech:feat/ceph-score
NationTech:feat/opnsense-bootstrap-score
NationTech:feat/fleet-e2e
NationTech:feat/fleet-e2e-harness-and-ping
NationTech:feat/dashboard-auth
NationTech:feat/fleet-operator-web-frontend
NationTech:feat/deploy_fleet_server_side
NationTech:feat/openwebui
NationTech:feat/iot-aggregation-scale
NationTech:feat/iot-operator-helm-chart
NationTech:feat/removesideeffect
NationTech:feat/test-alert-receivers-sttest
NationTech:feat/brocade-client-add-vlans
NationTech:feat/agent-desired-state
NationTech:feat/opnsense-dns-implementation
NationTech:feat/named-config-instances
NationTech:worktree-bridge-cse_012j1jB37XfjXvDGHUjHrKSj
NationTech:chore/leftover-adr
NationTech:feat/config_e2e_zitadel_openbao
NationTech:example/vllm
NationTech:feat/config_sqlite
NationTech:chore/roadmap
NationTech:feature/kvm-module
NationTech:feat/rustfs
NationTech:feat/harmony_assets
NationTech:feat/brocade_assisted_setup
NationTech:feat/cluster_alerting_score
NationTech:e2e-tests-multicluster
NationTech:fix/refactor_alert_receivers
NationTech:feat/change-node-readiness-strategy
NationTech:feat/zitadel
NationTech:feat/improve-inventory-discovery
NationTech:fix/monitoring_abstractions_openshift
NationTech:feat/nats-jetstream
NationTech:adr-nats-creds
NationTech:feat/st_test
NationTech:feat/dockerAutoinstall
NationTech:chore/cleanup_hacluster
NationTech:doc/cert-management
NationTech:feat/certificate_management
NationTech:adr/017-staleness-failover
NationTech:fix/nats_non_root
NationTech:feat/rebuild_inventory
NationTech:fix/opnsense_update
NationTech:feat/unshedulable_control_planes
NationTech:feat/worker_okd_install
NationTech:doc-and-braindump
NationTech:fix/pxe_install
NationTech:switch-client
NationTech:okd_enable_user_workload_monitoring
NationTech:configure-switch
NationTech:fix/clippy
NationTech:feat/gen-ca-cert
NationTech:feat/okd_default_ingress_class
NationTech:fix/add_routes_to_domain
NationTech:secrets-prompt-editor
NationTech:feat/multisiteApplication
NationTech:feat/ceph-install-score
NationTech:feat/ceph-osd-score
NationTech:feat/ceph_validate_health
NationTech:better-indicatif-progress-grouped
NationTech:feat/crd-alertmanager-configs
NationTech:better-cli
NationTech:opnsense_upgrade
NationTech:feat/monitoring-application-feature
NationTech:dev/postgres
NationTech:feat/cd/localdeploymentdemo
NationTech:feat/webhook_receiver
NationTech:feat/kube-prometheus
NationTech:feat/init_k8s_tenant
NationTech:feat/discord-webhook-receiver
NationTech:feat/kube-prometheus-monitor
NationTech:feat/tenantScore
NationTech:feat/teams-integration
NationTech:feat/slack-notifs
NationTech:monitoring
NationTech:runtime-profiles
1 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 76ecf6da42 |
feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4)
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.
Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
the heartbeat, Verb::UpgradeStop on the command protocol.
Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
on it (DRY — second consumer is the agent). Tested with httptest.
Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
`--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
executor self-heals the on-disk layout so first-upgrade rollback is safe even
before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
subscribed). The agent never self-stops.
Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
version's heartbeat (single source of truth); reflects currentVersion + the
upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.
Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
|