Files

Jean-Gabriel Gill-Couture 51b39505bb

Run Check Script / check (pull_request) Failing after 37s

Details

docs(adr): reframe desired-state ADR as proposal and explore alternatives

Re-frame ADR-021 from an accepted shell-executor decision into an
explicit problem statement plus one candidate proposal (Alternative A),
with an Open Questions section capturing the concerns raised during
review: wrong abstraction level, no idempotency, no resource model, no
typed status, incoherence with the Score-Topology-Interpret pattern,
and weak security posture.

Add ADR-022 enumerating four alternatives:
- A: shell command executor (current scaffold)
- B: mini-kubelet with typed resource manifests and reconcilers
- C: embedded Score interpreter on the agent
- D: hybrid — typed manifests now, Scores later

Recommends Alternative D: ship typed AgentManifest/AgentStatus with a
small fixed reconciler set for the IoT MVP, keeping an explicit
migration seam to the Score-based end state once Scores become
uniformly wire-serializable.

Also documents what specifically is wrong with the happy-path shell
executor in harmony_agent/src/desired_state.rs and clarifies that the
NATS KV watch + typed CAS write skeleton is reusable, while the
execute_command shell-out should be gated behind an audited ShellJob
variant or deleted once real reconcilers land.

2026-04-10 07:13:38 -04:00

9.4 KiB

Raw Permalink Blame History

ADR-021: Agent Desired-State Convergence — Problem Statement and Initial Proposal

Status: Proposed (under review — see ADR-022 for alternatives) Date: 2026-04-09

This document was originally drafted as an "Accepted" ADR describing a shell-command executor. On review, the team was not convinced that the shell-executor shape is the right one. It has been re-framed as a problem statement + one candidate proposal (Alternative A). Alternative designs — including a mini-kubelet model and an embedded-Score model — are explored in ADR-022. A final decision has not been made.

Context

The Harmony Agent (ADR-016) currently handles a single use case: PostgreSQL HA failover via DeploymentConfig::FailoverPostgreSQL. For the IoT fleet management platform (Raspberry Pi clusters deployed in homes, offices, and community spaces), we need the agent to become a general-purpose desired-state convergence engine.

Concretely, the central Harmony control plane must be able to:

Express the desired state of an individual Pi (or a class of Pis) in a typed, serializable form.
Ship that desired state to the device over the existing NATS JetStream mesh (ADR-017-1).
Have the on-device agent reconcile toward it — idempotently, observably, and without manual intervention.
Read back an authoritative, typed actual state so the control plane can report convergence, surface errors, and drive a fleet dashboard.

The existing heartbeat / failover machinery (ADR-017-3) remains valuable — it proves the agent can maintain persistent NATS connections, do CAS writes against KV, and react to state changes. Whatever desired-state mechanism we add extends that foundation rather than replacing it.

Design forces

Coherence with the rest of Harmony. Harmony's entire identity is Score-Topology-Interpret with compile-time safety. A desired-state mechanism that reintroduces stringly-typed, runtime-validated blobs on the edge would be a regression from our own design rules (see CLAUDE.md: "Capabilities are industry concepts, not tools", "Scores encapsulate operational complexity", "Scores must be idempotent").
The "mini-kubelet" framing. The team is converging on a mental model where the agent is a stripped-down kubelet: it owns a set of local reconcilers, maintains a PLEG-like state machine per managed resource, and converges toward a declarative manifest. ADR-017-3 is already explicitly Kubernetes-inspired for staleness detection. This framing should inform the desired-state design, not fight it.
Speed to IoT MVP. We need something shippable soon enough that real Pi fleets can be demoed. Over-engineering the v1 risks never shipping; under-engineering it risks a rewrite once the wrong abstraction is entrenched on hundreds of devices in the field.
Security. Whatever lands on the device is, by construction, running with the agent's privileges. A mechanism that reduces to "run this shell string as root" is a very wide blast radius.
Serializability. Today, Harmony Scores are not uniformly serializable across the wire — many hold trait objects, closures, or references to live topologies. Any design that assumes "just send a Score" needs to confront this.

Initial Proposal (Alternative A — Shell Command Executor)

This is the first-pass design, implemented as a happy-path scaffold on this branch. It is presented here for critique, not as a settled decision.

Desired-State Model

Each agent watches a NATS KV key desired-state.<agent-id> for its workload definition. When the value changes, the agent executes the workload and reports the result to actual-state.<agent-id>. This is a pull-based convergence loop: the control plane writes intent, the agent converges, the control plane reads the result.

A DesiredState is a serializable description of what should be running on the device. For this first iteration, it is a shell command plus a monotonic generation counter.

enum DeploymentConfig {
    FailoverPostgreSQL(FailoverCNPGConfig),  // existing
    DesiredState(DesiredStateConfig),         // new
}

struct DesiredStateConfig {
    command: String,
    generation: u64,
}

Config Flow

  Central Platform                    NATS JetStream                  Agent (Pi)
  ================                    ==============                  ==========
  1. Write desired state  ------->  KV: desired-state.<agent-id>
                                                                    2. Watch detects change
                                                                    3. Execute workload
                                                                    4. Write result  -------->  KV: actual-state.<agent-id>
  5. Read actual state    <-------  KV: actual-state.<agent-id>

The agent's heartbeat loop continues independently. The desired-state watcher runs as a separate async task, sharing the same NATS connection. This separation means a slow command execution does not block heartbeats.

State Reporting

struct ActualState {
    agent_id: Id,
    generation: u64,          // mirrors the desired-state generation
    status: ExecutionStatus,   // Success, Failed, Running
    stdout: String,
    stderr: String,
    exit_code: Option<i32>,
    executed_at: u64,
}

The control plane reads this key to determine convergence. If actual_state.generation == desired_state.generation and status == Success, the device has converged.

Why this shape was chosen first

Dirt cheap to implement (≈200 lines, done on this branch).
Works for literally any task a human would type into a Pi shell.
Reuses the existing NATS KV infrastructure and CAS write idiom already proven by the heartbeat loop.
Provides an end-to-end demo path in under a day.

Open Questions and Concerns

The following concerns block promoting this to an "Accepted" decision:

Wrong abstraction level. sh -c "<string>" is the opposite of what Harmony stands for. Harmony exists because IaC tools drown in stringly-typed, runtime-validated config. Shipping arbitrary shell to the edge recreates that problem inside our own agent — at the worst possible place (the device).
No idempotency. systemctl start foo and apt install foo are not idempotent by themselves. Every Score in Harmony is required to be idempotent. A shell executor pushes that burden onto whoever writes the commands, where we cannot check it.
No resource model. There is no notion of "this manifest owns this systemd unit". When desired state changes, we cannot compute a diff, we cannot garbage-collect the old resource, and we cannot surface "drift" meaningfully. We know generation N was "run"; we do not know what it left behind.
No typed status. stdout/stderr/exit_code is not enough to drive a fleet dashboard. We want typed Status { container: Running { since, restarts }, unit: Active, file: PresentAt(sha256) }.
No lifecycle. Shell commands are fire-and-forget. A kubelet-shaped agent needs to know whether a resource is still healthy after it was created — liveness and readiness are first-class concerns, not a post-hoc exit_code check.
Security. The ADR hand-waves "NATS ACLs + future signing". In practice, v1 lets anyone with write access to the KV bucket execute anything as the agent user. Even with NATS ACLs, the shape of the API invites abuse; a typed manifest with an allowlist of resource types has a much narrower attack surface by construction.
Generational model is too coarse. A single generation: u64 per agent means we can only describe one monolithic "job". Real fleet state is a set of resources (this container, this unit, this file). We need per-resource generations, or a manifest-level generation with a sub-resource status map.
Incoherent with ADR-017-3's kubelet framing. That ADR deliberately borrowed K8s vocabulary (staleness, fencing, leader promotion) because kubelet-like semantics are the right ones for resilient edge workloads. Shell-exec abandons that lineage at the first opportunity.
Coherence with the Score-Topology-Interpret pattern. Today's proposal introduces a parallel concept ("DesiredStateConfig") that has nothing to do with Score or Topology. If a Pi is just "a topology with a small capability set" (systemd, podman, files, network), then the right thing to ship is a Score, not a shell string.

Status of the Implementation on this Branch

The happy-path code in harmony_agent/src/desired_state.rs (≈250 lines, fully tested) implements Alternative A. It is scaffolding, not a committed design:

It is useful as a vehicle to prove out the NATS KV watch + typed ActualState CAS write pattern, both of which are reusable regardless of which alternative we pick.
It should not be wired into user-facing tooling until the architectural decision in ADR-022 is made.
If we adopt Alternative B (mini-kubelet) or C (embedded Scores), the shell executor either becomes one variant of a typed Resource enum (a ShellJob resource, clearly labeled as an escape hatch) or is deleted outright.

Next Steps

Review ADR-022 (alternatives + recommendation).
Pick a target design.
Either:
- Rework desired_state.rs to match the chosen target, or
- Keep it behind a feature flag as a demo fallback while the real design is built.
Re-file this ADR as "Superseded by ADR-022" or update it in place with the accepted design.

9.4 KiB Raw Permalink Blame History