feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4) #330
Open
johnride
wants to merge 1 commits from
feat/fleet-ch4-agent-upgrade into feat/fleet-ch3-log-streaming
pull from: feat/fleet-ch4-agent-upgrade
merge into: NationTech:feat/fleet-ch3-log-streaming
NationTech:master
NationTech:feat/fleet-ch2-operator-recovery
NationTech:feat/fleet-device-exec-logs
NationTech:feat/zitadel-web-pkce-and-human-user
NationTech:feat/jwt-bearer-openbao-auth
NationTech:feat/fleet-ch5-graceful-deploy-upgrade
NationTech:feat/fleet-ch3-log-streaming
NationTech:feat/add-claims-for-openbao
NationTech:refactor/move-zitadel-jwt-to-module
NationTech:feat/fleet-operator-real-data
NationTech:docs/fleet-secrets-device-access
NationTech:chore/fleet-operator-prune-mock-dtos
NationTech:chore/rename-release-to-publish
NationTech:refactor/config-namespace-env-var
NationTech:feat/fleet-staging-openbao
NationTech:feat/auth-add-next-url-redirect
NationTech:pr/harmony-sso-example
NationTech:feat/unified-config-and-secrets
NationTech:ci/fleet-argo-cd
NationTech:ci/fleet-operator-release-pipeline
NationTech:feat/on-device-key-gen
NationTech:feat/install-gitea
NationTech:feat/v0-3-logs-companion
NationTech:refactor/smoke-companion-minimal
NationTech:feat/smoke-test-contract
NationTech:feat/iobench-redpanda-profile
NationTech:feat/v0-3-dashboard-role-enforcement
NationTech:feat/v0-3-init-containers
NationTech:feat/v0-3-operator-restart-baseline
NationTech:feat/fleet-e2e-x86
NationTech:feat/ceph-score
NationTech:feat/opnsense-bootstrap-score
NationTech:feat/fleet-e2e
NationTech:feat/fleet-e2e-harness-and-ping
NationTech:feat/dashboard-auth
NationTech:feat/fleet-operator-web-frontend
NationTech:feat/deploy_fleet_server_side
NationTech:feat/openwebui
NationTech:feat/iot-aggregation-scale
NationTech:feat/iot-operator-helm-chart
NationTech:feat/removesideeffect
NationTech:feat/test-alert-receivers-sttest
NationTech:feat/brocade-client-add-vlans
NationTech:feat/agent-desired-state
NationTech:feat/opnsense-dns-implementation
NationTech:feat/named-config-instances
NationTech:worktree-bridge-cse_012j1jB37XfjXvDGHUjHrKSj
NationTech:chore/leftover-adr
NationTech:feat/config_e2e_zitadel_openbao
NationTech:example/vllm
NationTech:feat/config_sqlite
NationTech:chore/roadmap
NationTech:feature/kvm-module
NationTech:feat/rustfs
NationTech:feat/harmony_assets
NationTech:feat/brocade_assisted_setup
NationTech:feat/cluster_alerting_score
NationTech:e2e-tests-multicluster
NationTech:fix/refactor_alert_receivers
NationTech:feat/change-node-readiness-strategy
NationTech:feat/zitadel
NationTech:feat/improve-inventory-discovery
NationTech:fix/monitoring_abstractions_openshift
NationTech:feat/nats-jetstream
NationTech:adr-nats-creds
NationTech:feat/st_test
NationTech:feat/dockerAutoinstall
NationTech:chore/cleanup_hacluster
NationTech:doc/cert-management
NationTech:feat/certificate_management
NationTech:adr/017-staleness-failover
NationTech:fix/nats_non_root
NationTech:feat/rebuild_inventory
NationTech:fix/opnsense_update
NationTech:feat/unshedulable_control_planes
NationTech:feat/worker_okd_install
NationTech:doc-and-braindump
NationTech:fix/pxe_install
NationTech:switch-client
NationTech:okd_enable_user_workload_monitoring
NationTech:configure-switch
NationTech:fix/clippy
NationTech:feat/gen-ca-cert
NationTech:feat/okd_default_ingress_class
NationTech:fix/add_routes_to_domain
NationTech:secrets-prompt-editor
NationTech:feat/multisiteApplication
NationTech:feat/ceph-install-score
NationTech:feat/ceph-osd-score
NationTech:feat/ceph_validate_health
NationTech:better-indicatif-progress-grouped
NationTech:feat/crd-alertmanager-configs
NationTech:better-cli
NationTech:opnsense_upgrade
NationTech:feat/monitoring-application-feature
NationTech:dev/postgres
NationTech:feat/cd/localdeploymentdemo
NationTech:feat/webhook_receiver
NationTech:feat/kube-prometheus
NationTech:feat/init_k8s_tenant
NationTech:feat/discord-webhook-receiver
NationTech:feat/kube-prometheus-monitor
NationTech:feat/tenantScore
NationTech:feat/teams-integration
NationTech:feat/slack-notifs
NationTech:monitoring
NationTech:runtime-profiles
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
No description provided.
Delete Branch "feat/fleet-ch4-agent-upgrade"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.
Contracts (harmony-reconciler-contracts):
the heartbeat, Verb::UpgradeStop on the command protocol.
Shared (new crate harmony_downloadable_asset):
on it (DRY — second consumer is the agent). Tested with httptest.
Agent (harmony-fleet-agent):
drive: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, withheartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
--self-test, atomic symlink swap, systemd-run transient unit, revert. Theexecutor self-heals the on-disk layout so first-upgrade rollback is safe even
before M1 (preserves the running binary at its versioned path).
--self-testflag; Verb::UpgradeStop handling gated by an armedUpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
subscribed). The agent never self-stops.
Operator (harmony-fleet-operator):
version's heartbeat (single source of truth); reflects currentVersion + the
upgrade phase onto the Device CR. 2 unit tests on the commit decision.
Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
Full ADR-022 protocol end to end. The state-machine brain and the operator's commit decision are exhaustively unit-tested; OS side-effects sit behind a seam so they're faked in tests and real on-device. Contracts (harmony-reconciler-contracts): - agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on the heartbeat, Verb::UpgradeStop on the command protocol. Shared (new crate harmony_downloadable_asset): - download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends on it (DRY — second consumer is the agent). Tested with httptest. Agent (harmony-fleet-agent): - `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path. - UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify, `--self-test`, atomic symlink swap, systemd-run transient unit, revert. The executor self-heals the on-disk layout so first-upgrade rollback is safe even before M1 (preserves the running binary at its versioned path). - `--self-test` flag; Verb::UpgradeStop handling gated by an armed UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are subscribed). The agent never self-stops. Operator (harmony-fleet-operator): - upgrade_coordinator: sends the stop ONLY after independently observing the new version's heartbeat (single source of truth); reflects currentVersion + the upgrade phase onto the Device CR. 2 unit tests on the commit decision. - FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}. Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV (survive operator restart, per Ch2).654b979da3to76ecf6da42View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.