feat(fleet): graceful roll-forward upgrade + container-ID identity (Ch5) #331

Open
johnride wants to merge 1 commits from feat/fleet-ch5-graceful-deploy-upgrade into feat/fleet-ch4-agent-upgrade
Owner

Fixes the 30s redeploy loop and adds graceful deployment upgrades — same root:
how the agent identifies a running container vs. its desired spec.

Redeploy-loop fix (container identity by id/name, not spec compare):

  • ensure_service_running is now liveness-only — a running container is a NOOP,
    no spec comparison (podman ps can't read env/volumes, which made the old
    matches_spec recreate env-bearing services every tick). The periodic tick
    adopts-or-restarts by name and never recreates a healthy container.
  • Spec changes are detected by the reconciler's existing byte-compare of the
    desired-state JSON, not by the runtime. The agent records each container's id
    (from start_service) in DeploymentState.container_ids.
  • Deleted the matches_spec FIXME and its always-drift hack.

Graceful roll-forward upgrade:

  • PodmanV0Score.lifecycle: Option (SIGTERM, 30s grace, SIGKILL
    fallback). stop_signal baked into the container at create; grace drives
    podman stop --time.
  • On a changed score the reconciler stops the exact old container by its
    recorded id (graceful), then starts the new one; dropped services are stopped.
    Roll-forward only — a failure reports Phase::Failed, never reverts.

New ContainerRuntime methods: start_service (→id), container_status, stop_service
(graceful). Reconciler is now generic over dyn ContainerRuntime, unit-tested
against a FakeRuntime: tick-idempotency (loop killed), graceful-replace-by-id,
roll-forward-no-revert, unchanged-noop.

Architecture + flagged VM v1->v2->v3 e2e in
ROADMAP/fleet_platform/ch5-graceful-upgrade-status.md.

Fixes the 30s redeploy loop and adds graceful deployment upgrades — same root: how the agent identifies a running container vs. its desired spec. Redeploy-loop fix (container identity by id/name, not spec compare): - `ensure_service_running` is now liveness-only — a running container is a NOOP, no spec comparison (podman ps can't read env/volumes, which made the old matches_spec recreate env-bearing services every tick). The periodic tick adopts-or-restarts by name and never recreates a healthy container. - Spec changes are detected by the reconciler's existing byte-compare of the desired-state JSON, not by the runtime. The agent records each container's id (from start_service) in DeploymentState.container_ids. - Deleted the matches_spec FIXME and its always-drift hack. Graceful roll-forward upgrade: - PodmanV0Score.lifecycle: Option<LifecyclePolicy> (SIGTERM, 30s grace, SIGKILL fallback). stop_signal baked into the container at create; grace drives `podman stop --time`. - On a changed score the reconciler stops the exact old container by its recorded id (graceful), then starts the new one; dropped services are stopped. Roll-forward only — a failure reports Phase::Failed, never reverts. New ContainerRuntime methods: start_service (→id), container_status, stop_service (graceful). Reconciler is now generic over `dyn ContainerRuntime`, unit-tested against a FakeRuntime: tick-idempotency (loop killed), graceful-replace-by-id, roll-forward-no-revert, unchanged-noop. Architecture + flagged VM v1->v2->v3 e2e in ROADMAP/fleet_platform/ch5-graceful-upgrade-status.md.
johnride added 1 commit 2026-06-05 01:35:17 +00:00
feat(fleet): graceful roll-forward upgrade + container-ID identity (Ch5)
Some checks failed
Run Check Script / check (pull_request) Failing after 1m0s
3f33e032a3
Fixes the 30s redeploy loop and adds graceful deployment upgrades — same root:
how the agent identifies a running container vs. its desired spec.

Redeploy-loop fix (container identity by id/name, not spec compare):
- `ensure_service_running` is now liveness-only — a running container is a NOOP,
  no spec comparison (podman ps can't read env/volumes, which made the old
  matches_spec recreate env-bearing services every tick). The periodic tick
  adopts-or-restarts by name and never recreates a healthy container.
- Spec changes are detected by the reconciler's existing byte-compare of the
  desired-state JSON, not by the runtime. The agent records each container's id
  (from start_service) in DeploymentState.container_ids.
- Deleted the matches_spec FIXME and its always-drift hack.

Graceful roll-forward upgrade:
- PodmanV0Score.lifecycle: Option<LifecyclePolicy> (SIGTERM, 30s grace, SIGKILL
  fallback). stop_signal baked into the container at create; grace drives
  `podman stop --time`.
- On a changed score the reconciler stops the exact old container by its
  recorded id (graceful), then starts the new one; dropped services are stopped.
  Roll-forward only — a failure reports Phase::Failed, never reverts.

New ContainerRuntime methods: start_service (→id), container_status, stop_service
(graceful). Reconciler is now generic over `dyn ContainerRuntime`, unit-tested
against a FakeRuntime: tick-idempotency (loop killed), graceful-replace-by-id,
roll-forward-no-revert, unchanged-noop.

Architecture + flagged VM v1->v2->v3 e2e in
ROADMAP/fleet_platform/ch5-graceful-upgrade-status.md.
johnride force-pushed feat/fleet-ch5-graceful-deploy-upgrade from 3f33e032a3 to dedfa19380 2026-06-05 19:48:56 +00:00 Compare
All checks were successful
Run Check Script / check (pull_request) Successful in 2m36s
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/fleet-ch5-graceful-deploy-upgrade:feat/fleet-ch5-graceful-deploy-upgrade
git checkout feat/fleet-ch5-graceful-deploy-upgrade
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#331
No description provided.