Some checks failed
Run Check Script / check (pull_request) Failing after 37s
Re-frame ADR-021 from an accepted shell-executor decision into an explicit problem statement plus one candidate proposal (Alternative A), with an Open Questions section capturing the concerns raised during review: wrong abstraction level, no idempotency, no resource model, no typed status, incoherence with the Score-Topology-Interpret pattern, and weak security posture. Add ADR-022 enumerating four alternatives: - A: shell command executor (current scaffold) - B: mini-kubelet with typed resource manifests and reconcilers - C: embedded Score interpreter on the agent - D: hybrid — typed manifests now, Scores later Recommends Alternative D: ship typed AgentManifest/AgentStatus with a small fixed reconciler set for the IoT MVP, keeping an explicit migration seam to the Score-based end state once Scores become uniformly wire-serializable. Also documents what specifically is wrong with the happy-path shell executor in harmony_agent/src/desired_state.rs and clarifies that the NATS KV watch + typed CAS write skeleton is reusable, while the execute_command shell-out should be gated behind an audited ShellJob variant or deleted once real reconcilers land.
118 lines
9.4 KiB
Markdown
118 lines
9.4 KiB
Markdown
# ADR-021: Agent Desired-State Convergence — Problem Statement and Initial Proposal
|
|
|
|
**Status:** Proposed (under review — see ADR-022 for alternatives)
|
|
**Date:** 2026-04-09
|
|
|
|
> This document was originally drafted as an "Accepted" ADR describing a shell-command executor. On review, the team was not convinced that the shell-executor shape is the right one. It has been re-framed as a **problem statement + one candidate proposal (Alternative A)**. Alternative designs — including a mini-kubelet model and an embedded-Score model — are explored in [ADR-022](./022-agent-desired-state-alternatives.md). A final decision has **not** been made.
|
|
|
|
## Context
|
|
|
|
The Harmony Agent (ADR-016) currently handles a single use case: PostgreSQL HA failover via `DeploymentConfig::FailoverPostgreSQL`. For the IoT fleet management platform (Raspberry Pi clusters deployed in homes, offices, and community spaces), we need the agent to become a general-purpose desired-state convergence engine.
|
|
|
|
Concretely, the central Harmony control plane must be able to:
|
|
|
|
1. Express the *desired state* of an individual Pi (or a class of Pis) in a typed, serializable form.
|
|
2. Ship that desired state to the device over the existing NATS JetStream mesh (ADR-017-1).
|
|
3. Have the on-device agent reconcile toward it — idempotently, observably, and without manual intervention.
|
|
4. Read back an authoritative, typed *actual state* so the control plane can report convergence, surface errors, and drive a fleet dashboard.
|
|
|
|
The existing heartbeat / failover machinery (ADR-017-3) remains valuable — it proves the agent can maintain persistent NATS connections, do CAS writes against KV, and react to state changes. Whatever desired-state mechanism we add **extends** that foundation rather than replacing it.
|
|
|
|
### Design forces
|
|
|
|
- **Coherence with the rest of Harmony.** Harmony's entire identity is Score-Topology-Interpret with compile-time safety. A desired-state mechanism that reintroduces stringly-typed, runtime-validated blobs on the edge would be a regression from our own design rules (see `CLAUDE.md`: "Capabilities are industry concepts, not tools", "Scores encapsulate operational complexity", "Scores must be idempotent").
|
|
- **The "mini-kubelet" framing.** The team is converging on a mental model where the agent is a stripped-down kubelet: it owns a set of local reconcilers, maintains a PLEG-like state machine per managed resource, and converges toward a declarative manifest. ADR-017-3 is already explicitly Kubernetes-inspired for staleness detection. This framing should inform the desired-state design, not fight it.
|
|
- **Speed to IoT MVP.** We need something shippable soon enough that real Pi fleets can be demoed. Over-engineering the v1 risks never shipping; under-engineering it risks a rewrite once the wrong abstraction is entrenched on hundreds of devices in the field.
|
|
- **Security.** Whatever lands on the device is, by construction, running with the agent's privileges. A mechanism that reduces to "run this shell string as root" is a very wide blast radius.
|
|
- **Serializability.** Today, Harmony Scores are *not* uniformly serializable across the wire — many hold trait objects, closures, or references to live topologies. Any design that assumes "just send a Score" needs to confront this.
|
|
|
|
## Initial Proposal (Alternative A — Shell Command Executor)
|
|
|
|
This is the first-pass design, implemented as a happy-path scaffold on this branch. **It is presented here for critique, not as a settled decision.**
|
|
|
|
### Desired-State Model
|
|
|
|
Each agent watches a NATS KV key `desired-state.<agent-id>` for its workload definition. When the value changes, the agent executes the workload and reports the result to `actual-state.<agent-id>`. This is a pull-based convergence loop: the control plane writes intent, the agent converges, the control plane reads the result.
|
|
|
|
A `DesiredState` is a serializable description of what should be running on the device. For this first iteration, it is a shell command plus a monotonic generation counter.
|
|
|
|
```rust
|
|
enum DeploymentConfig {
|
|
FailoverPostgreSQL(FailoverCNPGConfig), // existing
|
|
DesiredState(DesiredStateConfig), // new
|
|
}
|
|
|
|
struct DesiredStateConfig {
|
|
command: String,
|
|
generation: u64,
|
|
}
|
|
```
|
|
|
|
### Config Flow
|
|
|
|
```
|
|
Central Platform NATS JetStream Agent (Pi)
|
|
================ ============== ==========
|
|
1. Write desired state -------> KV: desired-state.<agent-id>
|
|
2. Watch detects change
|
|
3. Execute workload
|
|
4. Write result --------> KV: actual-state.<agent-id>
|
|
5. Read actual state <------- KV: actual-state.<agent-id>
|
|
```
|
|
|
|
The agent's heartbeat loop continues independently. The desired-state watcher runs as a separate async task, sharing the same NATS connection. This separation means a slow command execution does not block heartbeats.
|
|
|
|
### State Reporting
|
|
|
|
```rust
|
|
struct ActualState {
|
|
agent_id: Id,
|
|
generation: u64, // mirrors the desired-state generation
|
|
status: ExecutionStatus, // Success, Failed, Running
|
|
stdout: String,
|
|
stderr: String,
|
|
exit_code: Option<i32>,
|
|
executed_at: u64,
|
|
}
|
|
```
|
|
|
|
The control plane reads this key to determine convergence. If `actual_state.generation == desired_state.generation` and `status == Success`, the device has converged.
|
|
|
|
### Why this shape was chosen first
|
|
|
|
- Dirt cheap to implement (≈200 lines, done on this branch).
|
|
- Works for literally any task a human would type into a Pi shell.
|
|
- Reuses the existing NATS KV infrastructure and CAS write idiom already proven by the heartbeat loop.
|
|
- Provides an end-to-end demo path in under a day.
|
|
|
|
## Open Questions and Concerns
|
|
|
|
The following concerns block promoting this to an "Accepted" decision:
|
|
|
|
1. **Wrong abstraction level.** `sh -c "<string>"` is the *opposite* of what Harmony stands for. Harmony exists because IaC tools drown in stringly-typed, runtime-validated config. Shipping arbitrary shell to the edge recreates that problem inside our own agent — at the worst possible place (the device).
|
|
2. **No idempotency.** `systemctl start foo` and `apt install foo` are not idempotent by themselves. Every Score in Harmony is required to be idempotent. A shell executor pushes that burden onto whoever writes the commands, where we cannot check it.
|
|
3. **No resource model.** There is no notion of "this manifest owns this systemd unit". When desired state changes, we cannot compute a diff, we cannot garbage-collect the old resource, and we cannot surface "drift" meaningfully. We know generation N was "run"; we do not know what it left behind.
|
|
4. **No typed status.** `stdout`/`stderr`/`exit_code` is not enough to drive a fleet dashboard. We want typed `Status { container: Running { since, restarts }, unit: Active, file: PresentAt(sha256) }`.
|
|
5. **No lifecycle.** Shell commands are fire-and-forget. A kubelet-shaped agent needs to know whether a resource is *still* healthy after it was created — liveness and readiness are first-class concerns, not a post-hoc `exit_code` check.
|
|
6. **Security.** The ADR hand-waves "NATS ACLs + future signing". In practice, v1 lets anyone with write access to the KV bucket execute anything as the agent user. Even with NATS ACLs, the *shape* of the API invites abuse; a typed manifest with an allowlist of resource types has a much narrower attack surface by construction.
|
|
7. **Generational model is too coarse.** A single `generation: u64` per agent means we can only describe one monolithic "job". Real fleet state is a *set* of resources (this container, this unit, this file). We need per-resource generations, or a manifest-level generation with a sub-resource status map.
|
|
8. **Incoherent with ADR-017-3's kubelet framing.** That ADR deliberately borrowed K8s vocabulary (staleness, fencing, leader promotion) because kubelet-like semantics are the right ones for resilient edge workloads. Shell-exec abandons that lineage at the first opportunity.
|
|
9. **Coherence with the Score-Topology-Interpret pattern.** Today's proposal introduces a parallel concept ("DesiredStateConfig") that has nothing to do with Score or Topology. If a Pi is just "a topology with a small capability set" (systemd, podman, files, network), then the right thing to ship is a Score, not a shell string.
|
|
|
|
## Status of the Implementation on this Branch
|
|
|
|
The happy-path code in `harmony_agent/src/desired_state.rs` (≈250 lines, fully tested) implements Alternative A. It is **scaffolding**, not a committed design:
|
|
|
|
- It is useful as a vehicle to prove out the NATS KV watch + typed `ActualState` CAS write pattern, both of which are reusable regardless of which alternative we pick.
|
|
- It should **not** be wired into user-facing tooling until the architectural decision in ADR-022 is made.
|
|
- If we adopt Alternative B (mini-kubelet) or C (embedded Scores), the shell executor either becomes one *variant* of a typed `Resource` enum (a `ShellJob` resource, clearly labeled as an escape hatch) or is deleted outright.
|
|
|
|
## Next Steps
|
|
|
|
1. Review ADR-022 (alternatives + recommendation).
|
|
2. Pick a target design.
|
|
3. Either:
|
|
- Rework `desired_state.rs` to match the chosen target, **or**
|
|
- Keep it behind a feature flag as a demo fallback while the real design is built.
|
|
4. Re-file this ADR as "Superseded by ADR-022" or update it in place with the accepted design.
|