Fixes the 30s redeploy loop and adds graceful deployment upgrades — same root: how the agent identifies a running container vs. its desired spec. Redeploy-loop fix (container identity by id/name, not spec compare): - `ensure_service_running` is now liveness-only — a running container is a NOOP, no spec comparison (podman ps can't read env/volumes, which made the old matches_spec recreate env-bearing services every tick). The periodic tick adopts-or-restarts by name and never recreates a healthy container. - Spec changes are detected by the reconciler's existing byte-compare of the desired-state JSON, not by the runtime. The agent records each container's id (from start_service) in DeploymentState.container_ids. - Deleted the matches_spec FIXME and its always-drift hack. Graceful roll-forward upgrade: - PodmanV0Score.lifecycle: Option<LifecyclePolicy> (SIGTERM, 30s grace, SIGKILL fallback). stop_signal baked into the container at create; grace drives `podman stop --time`. - On a changed score the reconciler stops the exact old container by its recorded id (graceful), then starts the new one; dropped services are stopped. Roll-forward only — a failure reports Phase::Failed, never reverts. New ContainerRuntime methods: start_service (→id), container_status, stop_service (graceful). Reconciler is now generic over `dyn ContainerRuntime`, unit-tested against a FakeRuntime: tick-idempotency (loop killed), graceful-replace-by-id, roll-forward-no-revert, unchanged-noop. Architecture + flagged VM v1->v2->v3 e2e in ROADMAP/fleet_platform/ch5-graceful-upgrade-status.md.
3.4 KiB
Ch5 — Graceful deployment upgrade (roll-forward) + redeploy-loop fix: status
Two things shipped together because they share the same root: how the agent identifies a running container vs. its desired spec.
The redeploy-loop fix (the bug JG flagged)
podman ps doesn't surface env/volumes, so the old matches_spec declared
any service with env or volumes "drifted" and re-created it every 30s tick —
flapping container ids, connectivity blips, log noise.
Fix (per JG: "use container IDs, store them in NATS"): identity is the container, by name+id, not a spec comparison.
ensure_service_runningis now liveness-only: a running container is a NOOP, no spec compare. The periodic tick adopts-or-(re)starts by name and never recreates a healthy container. The loop is gone — proven bytick_is_idempotent_no_recreate(a service with env, the old loop trigger).- Spec changes are detected where they always were — the reconciler's byte-compare of the desired-state JSON on a KV Put — not by the runtime.
- The agent records each container's id (from
start_service) in the cache and inDeploymentState.container_ids(surfaced for the dashboard).
Graceful roll-forward upgrade
PodmanV0Score.lifecycle: Option<LifecyclePolicy>(stop_signal=SIGTERM,grace_period_secs=30). The stop signal is baked into the container at create; the grace period drivespodman stop --time, after which podman always SIGKILLs — so there's nosigkill_fallbackknob to pretend otherwise (a config field nothing honors is a defect per AGENTS.md).- On a changed score the reconciler
converges: for each service, stop the exact old container by its recorded id (SIGTERM → grace → SIGKILL), then start the new one, record the new id. Services dropped from the spec are stopped. Roll-forward only: a failure reportsPhase::Failedand is never reverted (the customer edits the spec) — proven byroll_forward_failure_reports_failed_without_revert. - The transient cutover window reports
Phase::Pending(its documented "in flight" meaning) →Running, so the dashboard reflects the step without a new Phase variant.
Architecture note
The graceful-replace orchestration lives in the agent reconciler (which
holds the cross-reconcile state: cached score + container ids), driving the
ContainerRuntime capability directly. The PodmanV0Score interpret stays the
"ensure running" path for non-agent/initial converge. container_spec is the
one place both share the PodmanService → ContainerSpec mapping. The reconciler
is now generic over dyn ContainerRuntime, so the convergence logic is unit-
tested against a FakeRuntime (no podman socket).
Tested
tick_is_idempotent_no_recreate— the loop-fix.changed_score_gracefully_replaces_by_id— stop-old-by-id then start-new.unchanged_score_is_a_noop.roll_forward_failure_reports_failed_without_revert.- Plus the existing phase-transition tests, all green.
Flagged for a supervised run
- VM e2e v1→v2→v3 with controlled failures (
harmony-fleet-e2e/src/vm). Needs multiple image tags reachable from the KVM guest, like Ch4's upgrade e2e. The convergence logic is unit-proven against the fake runtime; this is the on-hardware integration proof. The existingvm_deploy_lifecycle.rscovers single-deploy. - Surfacing
container_idsin the dashboard device view (data is already onDeploymentState).