All checks were successful
Run Check Script / check (pull_request) Successful in 2m36s
Fixes the 30s redeploy loop and adds graceful deployment upgrades — same root: how the agent identifies a running container vs. its desired spec. Redeploy-loop fix (container identity by id/name, not spec compare): - `ensure_service_running` is now liveness-only — a running container is a NOOP, no spec comparison (podman ps can't read env/volumes, which made the old matches_spec recreate env-bearing services every tick). The periodic tick adopts-or-restarts by name and never recreates a healthy container. - Spec changes are detected by the reconciler's existing byte-compare of the desired-state JSON, not by the runtime. The agent records each container's id (from start_service) in DeploymentState.container_ids. - Deleted the matches_spec FIXME and its always-drift hack. Graceful roll-forward upgrade: - PodmanV0Score.lifecycle: Option<LifecyclePolicy> (SIGTERM, 30s grace, SIGKILL fallback). stop_signal baked into the container at create; grace drives `podman stop --time`. - On a changed score the reconciler stops the exact old container by its recorded id (graceful), then starts the new one; dropped services are stopped. Roll-forward only — a failure reports Phase::Failed, never reverts. New ContainerRuntime methods: start_service (→id), container_status, stop_service (graceful). Reconciler is now generic over `dyn ContainerRuntime`, unit-tested against a FakeRuntime: tick-idempotency (loop killed), graceful-replace-by-id, roll-forward-no-revert, unchanged-noop. Architecture + flagged VM v1->v2->v3 e2e in ROADMAP/fleet_platform/ch5-graceful-upgrade-status.md.
67 lines
3.4 KiB
Markdown
67 lines
3.4 KiB
Markdown
# Ch5 — Graceful deployment upgrade (roll-forward) + redeploy-loop fix: status
|
|
|
|
Two things shipped together because they share the same root: **how the agent
|
|
identifies a running container vs. its desired spec**.
|
|
|
|
## The redeploy-loop fix (the bug JG flagged)
|
|
|
|
`podman ps` doesn't surface env/volumes, so the old `matches_spec` declared
|
|
*any* service with env or volumes "drifted" and re-created it **every 30s tick** —
|
|
flapping container ids, connectivity blips, log noise.
|
|
|
|
Fix (per JG: "use container IDs, store them in NATS"): identity is the
|
|
**container, by name+id**, not a spec comparison.
|
|
- `ensure_service_running` is now **liveness-only**: a running container is a
|
|
NOOP, no spec compare. The periodic tick adopts-or-(re)starts by name and
|
|
never recreates a healthy container. The loop is gone — proven by
|
|
`tick_is_idempotent_no_recreate` (a service *with env*, the old loop trigger).
|
|
- Spec changes are detected where they always were — the reconciler's
|
|
byte-compare of the desired-state JSON on a KV Put — not by the runtime.
|
|
- The agent records each container's **id** (from `start_service`) in the cache
|
|
and in `DeploymentState.container_ids` (surfaced for the dashboard).
|
|
|
|
## Graceful roll-forward upgrade
|
|
|
|
- `PodmanV0Score.lifecycle: Option<LifecyclePolicy>` (`stop_signal=SIGTERM`,
|
|
`grace_period_secs=30`). The stop signal is baked into the container at
|
|
create; the grace period drives `podman stop --time`, after which podman
|
|
always SIGKILLs — so there's no `sigkill_fallback` knob to pretend otherwise
|
|
(a config field nothing honors is a defect per AGENTS.md).
|
|
- On a changed score the reconciler `converge`s: for each service, **stop the
|
|
exact old container by its recorded id** (SIGTERM → grace → SIGKILL), then
|
|
start the new one, record the new id. Services dropped from the spec are
|
|
stopped. **Roll-forward only**: a failure reports `Phase::Failed` and is never
|
|
reverted (the customer edits the spec) — proven by
|
|
`roll_forward_failure_reports_failed_without_revert`.
|
|
- The transient cutover window reports `Phase::Pending` (its documented "in
|
|
flight" meaning) → `Running`, so the dashboard reflects the step without a new
|
|
Phase variant.
|
|
|
|
## Architecture note
|
|
|
|
The graceful-replace orchestration lives in the **agent reconciler** (which
|
|
holds the cross-reconcile state: cached score + container ids), driving the
|
|
`ContainerRuntime` capability directly. The `PodmanV0Score` interpret stays the
|
|
"ensure running" path for non-agent/initial converge. `container_spec` is the
|
|
one place both share the `PodmanService → ContainerSpec` mapping. The reconciler
|
|
is now generic over `dyn ContainerRuntime`, so the convergence logic is unit-
|
|
tested against a `FakeRuntime` (no podman socket).
|
|
|
|
## Tested
|
|
|
|
- `tick_is_idempotent_no_recreate` — the loop-fix.
|
|
- `changed_score_gracefully_replaces_by_id` — stop-old-by-id then start-new.
|
|
- `unchanged_score_is_a_noop`.
|
|
- `roll_forward_failure_reports_failed_without_revert`.
|
|
- Plus the existing phase-transition tests, all green.
|
|
|
|
## Flagged for a supervised run
|
|
|
|
- **VM e2e v1→v2→v3 with controlled failures** (`harmony-fleet-e2e/src/vm`).
|
|
Needs multiple image tags reachable from the KVM guest, like Ch4's upgrade
|
|
e2e. The convergence logic is unit-proven against the fake runtime; this is
|
|
the on-hardware integration proof. The existing `vm_deploy_lifecycle.rs`
|
|
covers single-deploy.
|
|
- Surfacing `container_ids` in the dashboard device view (data is already on
|
|
`DeploymentState`).
|