docs(fleet): v0.3 last-mile roadmap #296
491
ROADMAP/fleet_platform/v0_3_plan.md
Normal file
491
ROADMAP/fleet_platform/v0_3_plan.md
Normal file
@@ -0,0 +1,491 @@
|
||||
# Fleet Platform v0.3 — last-mile plan
|
||||
|
||||
Authoritative plan for the last mile before the fleet ships to a real
|
||||
customer. Picks up where `v0_2_plan.md` left the chapter structure.
|
||||
Written 2026-05-24, after `feat/iot-walking-skeleton` (#264) merged
|
||||
and `feat/smoke-test-contract` landed the Phase 0 smoke companion.
|
||||
|
||||
**The frame:**
|
||||
|
||||
- **v0.1 proved the shape.**
|
||||
- **v0.2 locked the brick design.**
|
||||
- **v0.3 makes the brick safe to hand to a customer** running production workloads on Pis in their basement.
|
||||
|
||||
## State coming in
|
||||
|
||||
- **IoT walking skeleton merged.** Operator + agent + NATS + Zitadel + auth callout running end-to-end against an OKD staging cluster.
|
||||
- **Smoke-test contract Phase 0 merged** (`feat/smoke-test-contract`).
|
||||
- `Probe` / `SmokeSuite` / `SmokeTest` companion + `deploy_with_smoke` in `harmony-fleet-deploy/src/companion/smoke/`.
|
||||
- One concrete probe today: `TcpReachable`.
|
||||
- No fleet Score wired to a real smoke test yet — Phase 1 is in this roadmap.
|
||||
- **Agent runs as a systemd user unit on devices** (see `harmony/src/modules/fleet/setup_score.rs:263–283`).
|
||||
- No on-device containerized agent path.
|
||||
- The Dockerfile in `fleet/harmony-fleet-agent/Dockerfile` is k8s-only today.
|
||||
- **Dashboard has no role enforcement — security gap.**
|
||||
- Maud/htmx frontend at `fleet/harmony-fleet-operator/src/frontend/server.rs`.
|
||||
- Verifies Zitadel JWT signature + expiry only.
|
||||
- `JwksCache::verify` (`harmony_zitadel_auth/src/jwks.rs:74`) extracts `sub`/`exp`/`email`/`name`/`nonce` — no roles.
|
||||
- `VerifiedSession` has no `roles` field.
|
||||
- Any logged-in Zitadel user gets full dashboard access. Fix immediately (Chapter 1).
|
||||
- **NATS callout already has the role-extraction logic we need.**
|
||||
- `ZitadelValidator::extract_roles` at `nats/callout/src/zitadel.rs:203`.
|
||||
- Handles both array shape (`["fleet-admin"]`) and Zitadel's object-map shape (`{"fleet-admin": {org_id: org_name}}`).
|
||||
- `roles::resolve` maps role names to `ResolvedRole::Admin`/`::Device` with admin-wins privilege escalation.
|
||||
- Chapter 1 reuses the extractor, not the role-to-NATS-permission half.
|
||||
- **System upgrade ADR drafted** at `docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md`.
|
||||
- Header says Accepted 2026-05-24 but lives under `drafts/`.
|
||||
- Authoritative status: approach agreed, rollback half deferred (Chapter 7).
|
||||
|
||||
## Customer constraints baked into this plan
|
||||
|
||||
- **Deployments are roll-forward only.** No auto-rollback when a new Deployment version fails. Dashboard surfaces the failure; customer edits the spec and rolls forward. Customer ask; may change later, not in v0.3.
|
||||
- **System rollback is deferred to v0.4.** v0.3 implements *upgrade* per the ADR; the LVM-snapshot rollback half waits until we've shipped something to production.
|
||||
- **Secrets need Zitadel + OpenBao.** No plaintext-env-var shortcut. `harmony_secret` + OpenBao work is on the critical path for any Deployment that needs credentials.
|
||||
|
||||
## Feature checklist
|
||||
|
||||
Status legend: ✅ shipped · 🟡 in flight · 🔴 not started · ⏸ deferred (target version in note).
|
||||
|
||||
| # | Feature | Status | Owner / branch | Notes |
|
||||
|----|----------------------------------------------------------|---------|---------------------------------|-------|
|
||||
| 1 | Dashboard role enforcement (`fleet-admin` required) | 🔴 | next branch | Reuse `ZitadelValidator::extract_roles`. Do this **right now** — security gap. |
|
||||
| 2 | Operator restart / aggregator cold-rebuild | 🔴 | next branch | More critical than smoke wiring; ship before any customer. |
|
||||
| 3 | Deployment `getLogs` companion + dashboard log view | 🔴 | next branch | "Makes dashboard useful rather than a toy." Score companion shape. |
|
||||
| 4 | Agent self-upgrade (NATS-coordinated, systemd-resident) | 🔴 | new branch | Marker lives in NATS, not on disk. Systemd stays. |
|
||||
| 5 | Graceful deployment upgrade (roll-forward only) | 🔴 | new branch | SIGTERM → grace → SIGKILL fallback → start new. No rollback. |
|
||||
| 6 | Init containers in `PodmanV0Score` | 🔴 | new branch | Ordered, run-to-completion, customer guarantees idempotency. |
|
||||
| 7 | System upgrade (no rollback yet) | 🔴 | new branch | Per drafted ADR, minus the LVM-snapshot rollback half. |
|
||||
| 8 | Secrets via Zitadel + OpenBao for Deployments | ⏸ v0.3+ | blocked on `harmony_secret` | Required for production but not blocking the first customer. |
|
||||
| 9 | Agent time-drift verification | 🔴 | new branch | Periodic NTP check; refuse JWT operations if skewed. |
|
||||
| 10 | Phase 1 smoke wiring (HTTP / K8sPodReady / NatsKv probes) | 🔴 | new branch | After required features land. Not a functional blocker. |
|
||||
| 11 | CI yaml minimization (logic into `harmony-ci` scripts) | ⏸ v0.4 | longer-term | Yaml stays for discovery + parallel viz; scripts move. |
|
||||
| 12 | NATS callout CI hardening | ⏸ | low-churn crate | Already covered by workspace `cargo test`. Run ignored tests when CI has podman + NATS image. |
|
||||
| 13 | Application log streaming through NATS | ⏸ v0.4 | follow-on to #3 | #3 is the synchronous `getLogs`; this is the live tail. |
|
||||
| 14 | Deployment dependencies (`after: [...]`) | ⏸ | not chosen | Init containers (#6) cover the in-deployment case; defer until customers ask. |
|
||||
|
||||
## Sequencing
|
||||
|
||||
| Order | Item | Why |
|
||||
|-------|---------------------------------------------------|--------------------------------------------------------------------------------------------------|
|
||||
| 1 | #1 Dashboard role enforcement | Security gap, do right now. |
|
||||
| 2 | #2 Operator restart recovery | More critical than smoke wiring. Customer can't tolerate "operator restarted, state unknown." |
|
||||
| 3 | #3 Log forwarding companion | Turns the dashboard from a toy into a thing customers actually use. |
|
||||
| 4 | #4 Agent self-upgrade | Parallel-safe with #2/#3 — different code paths. |
|
||||
| 5 | #5 + #6 Graceful upgrade + init containers | Paired Deployment-layer features; ship together. |
|
||||
| 6 | #9 Time-drift verification | Small, isolated; slot between heavier items. |
|
||||
| 7 | #7 System upgrade | Builds on agent-upgrade pattern from #4 — #4 lands first. |
|
||||
| 8 | #10 Phase 1 smoke wiring | After required features so probes verify real customer-facing surfaces. |
|
||||
| 9 | #8 Secrets | Blocks any customer Deployment that needs credentials. Promote if first customer needs them. |
|
||||
| 10 | #11 / #12 CI | Opportunistic, doesn't block customer. |
|
||||
|
||||
---
|
||||
|
||||
## Chapter 1 — Dashboard role enforcement (#1)
|
||||
|
||||
**Goal:** every dashboard page requires a valid Zitadel session **and** a `fleet-admin` role on the token.
|
||||
- Users without the role get a 403 with a clear message.
|
||||
- Users without a session get the existing login redirect.
|
||||
|
||||
### Current state
|
||||
|
||||
- **JWKS verify only extracts identity claims.** `JwksCache::verify` (`harmony_zitadel_auth/src/jwks.rs:74`) parses the JWT and returns a `VerifiedSession` with `sub`/`exp`/`email`/`name`/`nonce`. Roles not extracted.
|
||||
- **`VerifiedSession` has no `roles` field** (`harmony_zitadel_auth/src/session.rs:5`).
|
||||
- **Middleware checks JWT validity only.** `require_auth` (`fleet/harmony-fleet-operator/src/frontend/server.rs:136–157`). Every authenticated user gets all pages.
|
||||
- **Role extraction logic already exists and is correct** in the callout: `ZitadelValidator::extract_roles` at `nats/callout/src/zitadel.rs:203`. Handles both shapes:
|
||||
- array — `["fleet-admin"]`
|
||||
- object-map — `{"fleet-admin": {org_id: org_name}}`
|
||||
|
||||
### Plan
|
||||
|
||||
1. **Extract a shared role-extraction helper into `harmony_zitadel_auth`** so dashboard and callout import from one place. Callout keeps its API but its body delegates.
|
||||
2. **Extend `VerifiedSession`** with `roles: Vec<String>`.
|
||||
3. **Extend the JWKS `Claims` decode struct** to capture the configured roles claim. Pull the claim name from existing callout config so the two systems agree (Zitadel ships `urn:zitadel:iam:org:project:roles` or similar).
|
||||
4. **Add `require_role(role: &'static str)` middleware** to the dashboard. Compose with `require_auth`. Use on every `Router::route(..., post|get(...).layer(...))`.
|
||||
5. **403 response renders a maud page** — "fleet-admin role required; ask your administrator." Not a JSON error; dashboard is human-facing.
|
||||
|
||||
### Tests
|
||||
|
||||
Security code — heavy unit tests are non-negotiable.
|
||||
|
||||
- **Array-shape claim → fleet-admin in session.** JWT with array-shape role claim.
|
||||
- **Object-map shape → identical resolution.** Same role, Zitadel's other claim shape.
|
||||
- **No role claim → empty roles.** Token with no `roles` claim.
|
||||
- **Wrong role doesn't elevate.** JWT with only `device` role does NOT carry `fleet-admin`.
|
||||
- **No session → 401/redirect.**
|
||||
- **Session but no `fleet-admin` → 403.**
|
||||
- **Session + `fleet-admin` → 200.**
|
||||
|
||||
### Done when
|
||||
|
||||
- Branch merged.
|
||||
- All dashboard handlers gated by `require_role("fleet-admin")`.
|
||||
- Every test green.
|
||||
- Manual smoke against staging Zitadel: user without role sees 403.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 2 — Operator restart + aggregator recovery (#2)
|
||||
|
||||
**Goal:** the operator pod can be killed, upgraded, or rescheduled at any time and the system converges back to correct state from NATS KV alone. No "unknown state" window visible to customers.
|
||||
|
||||
### Current state
|
||||
|
||||
- **Aggregator rebuilds from scratch on startup.** `fleet_aggregator.rs` (833 LOC, in `harmony-fleet-operator/src/`) watches the KV buckets to materialize state. JG confirmed: "rebuilt from scratch, yes."
|
||||
- **Failure modes not exercised yet:**
|
||||
- Partial KV — device offline during operator reset, never re-published its info.
|
||||
- Two operator pods racing during a rolling deploy of the operator.
|
||||
- NATS stream loss between operator restart and rebuild completing.
|
||||
- Stale KV — Deployment CR deleted in kube while operator was down.
|
||||
|
||||
### Plan
|
||||
|
||||
**Scenario-driven.** Enumerate failure shapes, then handle one at a time. Discipline: each scenario gets a regression test in `harmony-fleet-e2e`, then the fix.
|
||||
|
||||
1. **Scenario inventory pass.** Write `docs/fleet-operator-recovery-scenarios.md` listing every failure shape we can think of. Cross-reference smoke-a* tests to identify what's already covered.
|
||||
2. **Cold-start rebuild as the baseline.** Confirm + test that `kubectl delete pod` of the operator and waiting for the replacement converges to pre-kill aggregate in < 30s. Gate on convergence time at N device count.
|
||||
3. **Stale-KV reconciliation.** Define the rule for "KV says device D has Deployment X, but Deployment X no longer exists in kube." Operator cleans up; agents observe the deletion.
|
||||
4. **Leader election decision.** Ship with leader election (one writer at a time) or design for idempotent multi-writer? Score-Topology-Interpret leans idempotent; confirm + assert operator writes are byte-deterministic.
|
||||
5. **Liveness signaling for the dashboard.** Surface "operator converged" / "operator recovering" as states the frontend renders. Customer sees a loading banner, not a blank dashboard, during rebuild.
|
||||
|
||||
### Open questions
|
||||
|
||||
- **Warm-restart snapshot?** Keep a per-operator-pod "last known aggregate" snapshot in a KV bucket so warm restarts skip cold rebuild? Probably yes for >1000-device fleets; adds an invalidation problem.
|
||||
- **One pod or active/passive?** Customer's fleet size answers this. Ask before starting.
|
||||
|
||||
### Done when
|
||||
|
||||
- Scenario inventory exists.
|
||||
- Each scenario has a regression test, all green.
|
||||
- Manual chaos: kill operator pod during high write load → convergence + dashboard liveness banner observed.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 3 — Application log forwarding companion (#3)
|
||||
|
||||
**Goal:** when a customer's Deployment is misbehaving on a Pi in the field, the dashboard shows last-N-lines of container logs without anyone SSH-ing the device.
|
||||
|
||||
### Design
|
||||
|
||||
- **Logs attach as a Score companion** — same pattern as the smoke-test contract.
|
||||
- **The companion is optional** — Scores without one render "this deployment doesn't expose logs". Acceptable.
|
||||
- **Sync `getLogs` ships in v0.3; live tail (streaming) waits for v0.4** — that's the minimum useful UX.
|
||||
|
||||
Shape:
|
||||
|
||||
```rust
|
||||
// new in harmony-fleet-deploy/src/companion/logs/
|
||||
pub trait LogQuery<T: Topology>: Send + Sync {
|
||||
type Score: Score<T>;
|
||||
async fn last_lines(
|
||||
&self,
|
||||
score: &Self::Score,
|
||||
topology: &T,
|
||||
n: usize,
|
||||
) -> Result<LogChunk, LogQueryError>;
|
||||
}
|
||||
|
||||
pub struct LogChunk {
|
||||
pub source: ProbeName, // reuse the validated newtype
|
||||
pub captured_at: chrono::DateTime<chrono::Utc>,
|
||||
pub lines: Vec<String>,
|
||||
pub truncated: bool,
|
||||
}
|
||||
```
|
||||
|
||||
For `PodmanV0Score`:
|
||||
- **Transport: NATS request/reply.** Subject `device-commands.<device_id>.logs.<deployment>`.
|
||||
- **Agent side:** runs `podman logs --tail N <container>` and replies with a `LogChunk`.
|
||||
- **Dashboard side:** one async call from the logs handler.
|
||||
|
||||
### Plan
|
||||
|
||||
1. **Define `LogQuery` companion trait** in a new `harmony-fleet-deploy/src/companion/logs/` module.
|
||||
2. **`PodmanLogQuery`** implementing `LogQuery<…> for PodmanV0Score`.
|
||||
3. **Agent-side command handler** — extend the existing request/reply command dispatcher.
|
||||
4. **Dashboard handler** at `/deployments/<name>/devices/<id>/logs?lines=N` returning rendered maud.
|
||||
5. **Tests:** unit on `PodmanLogQuery`; integration in `harmony-fleet-e2e` driving end-to-end.
|
||||
|
||||
### Done when
|
||||
|
||||
- Customer clicks "View logs" on the dashboard.
|
||||
- Sees the last 200 lines.
|
||||
- Call returns in < 2s on a 3-device fleet.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 4 — Agent self-upgrade, NATS-coordinated (#4)
|
||||
|
||||
**Goal:** the agent can upgrade itself in place. If NATS is unavailable, the upgrade does not start. The operator sees every step.
|
||||
|
||||
### Design (per JG's direction)
|
||||
|
||||
- **Stay on systemd for v0.3.** Switching the agent runtime to podman is its own risk; defer until self-upgrade protocol matures.
|
||||
- **Upgrade marker lives in NATS, not on disk.** New KV bucket `agent-upgrade` keyed by `device_id`, carrying `start_timestamp`, `invoker_version`, `target_version`, `phase`.
|
||||
- **No NATS → no upgrade.** Feature, not limitation: operator can't observe an upgrade it can't see, so refusing without NATS prevents silent half-upgrades.
|
||||
|
||||
### Protocol
|
||||
|
||||
1. **Operator writes `Requested`.** `agent-upgrade/<device_id>` with `phase: Requested, target_version: vX`.
|
||||
2. **Old agent observes + writes `Suspending`.** Verifies NATS liveness with a round-trip first.
|
||||
3. **Old agent suspends + writes `Suspended`.** Reconcile loop paused; heartbeat continues so the operator knows it's alive.
|
||||
4. **Old agent fetches new binary + writes `Fetched`.** Mechanism TBD (see open questions). `target_path: /usr/local/bin/fleet-agent.new`.
|
||||
5. **Old agent launches new binary as a separate process + writes `NewLaunched`.** Not via systemd unit update yet. Includes `new_pid: N`.
|
||||
6. **New agent self-checks + writes `NewHealthy`.** Connects to NATS, verifies permissions, one-shot smoke (KV read, command channel echo).
|
||||
7. **Old agent writes `HandingOff` and exits.** Tells systemd to swap the binary: `systemctl daemon-reload` + `systemctl restart fleet-agent.service` with the new binary in place.
|
||||
8. **Systemd starts the unit pointing at the new binary.** Final state `phase: Complete, completed_at: T`.
|
||||
|
||||
**On stall (configurable, default 5 min):**
|
||||
- Marker writes `phase: Failed` with last successful step.
|
||||
- Operator surfaces this on the dashboard.
|
||||
- Customer / operator intervenes manually — **no auto-rollback in v0.3**, consistent with the deployment roll-forward-only rule.
|
||||
|
||||
### Open questions
|
||||
|
||||
- **Q1.1 Binary distribution.** Gitea release asset? Signed OCI artifact? Existing `arm-agents.yaml` uploads aarch64 binaries to releases — start with that.
|
||||
- **Q1.2 Verification.** Hash signature? GPG? Minimum: SHA-256 pinned in the upgrade-request payload.
|
||||
- **Q1.3 Atomic systemd swap.** `systemctl restart` is not atomic across binary-on-disk and process. Acceptable? Or `systemd-run --transient` shim?
|
||||
- **Q1.4 Cross-arch.** Fetch URL has to know the device's arch. KV `device-info` already carries this; confirm the agent reads its own arch correctly.
|
||||
|
||||
### Done when
|
||||
|
||||
- Branch contains the protocol implementation + e2e test driving v0.3.0 → v0.3.1 upgrade against a libvirt VM.
|
||||
- Operator sees every phase.
|
||||
- Failure path tested: deliberately corrupt the new binary → marker reads `Failed`, old agent stays running.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)
|
||||
|
||||
**Goal:** upgrading a Deployment's image/config replaces the old container without dropping traffic mid-request. If the new container won't start, the customer sees the failure clearly and fixes the spec.
|
||||
|
||||
### Design
|
||||
|
||||
Extend `PodmanV0Score` with a `lifecycle` block:
|
||||
|
||||
```rust
|
||||
pub struct PodmanV0Score {
|
||||
// ... existing fields ...
|
||||
pub lifecycle: Option<LifecyclePolicy>,
|
||||
}
|
||||
|
||||
pub struct LifecyclePolicy {
|
||||
pub stop_signal: StopSignal, // SIGTERM (default), SIGINT, SIGUSR1
|
||||
pub grace_period: Duration, // default 30s
|
||||
pub sigkill_fallback: bool, // default true
|
||||
}
|
||||
```
|
||||
|
||||
Agent's reconcile when image/config changes:
|
||||
|
||||
1. **Write `Upgrading` phase.** New `DeploymentState::Phase::Upgrading` variant. Dashboard shows the in-progress upgrade.
|
||||
2. **Send `stop_signal` to the old container.**
|
||||
3. **Wait up to `grace_period` for clean exit.**
|
||||
4. **SIGKILL fallback** if still running and `sigkill_fallback`.
|
||||
5. **Start new container.**
|
||||
6. **On startup failure: write `Failed` and stop.** Image pull error, exec error, crash within 5s. No revert to old image.
|
||||
7. **On success: write `Running`.** Optionally gated by a Phase-1 smoke test (Chapter 10) when that lands.
|
||||
|
||||
### Explicit non-goals
|
||||
|
||||
- **No auto-rollback.** Customer-asked constraint. Step 6 firing → dashboard shows "Deployment failed; previous version stopped" and the customer edits the spec.
|
||||
- **No "stale + new" window.** Single container per Deployment per device; short downtime during cutover is accepted.
|
||||
|
||||
### Done when
|
||||
|
||||
- Upgrade test in `harmony-fleet-e2e` walks v1 → v2 → v3 image swap with controlled failures.
|
||||
- Dashboard reflects every step.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 6 — Init containers (#6)
|
||||
|
||||
**Goal:** customer can declare init containers that run to completion before the main container starts. Mirror Kubernetes semantics so customer mental model transfers.
|
||||
|
||||
### Design
|
||||
|
||||
Extend `PodmanV0Score` with `init_containers: Vec<InitContainer>`:
|
||||
- **Ordered** — declaration order = run order.
|
||||
- **Run-to-completion** — each one must exit zero before the next starts.
|
||||
- **Fail-the-Deployment on init failure** — non-zero exit or timeout exceeded.
|
||||
|
||||
```rust
|
||||
pub struct InitContainer {
|
||||
pub name: String,
|
||||
pub image: String,
|
||||
pub args: Vec<String>,
|
||||
pub env: Vec<EnvVar>,
|
||||
pub volumes: Vec<VolumeMount>,
|
||||
pub timeout: Duration, // default 5 min
|
||||
}
|
||||
```
|
||||
|
||||
### Customer contract (document loudly)
|
||||
|
||||
**Init containers must be idempotent.** They run on every reconcile that requires a fresh main container — power-cycle recovery, graceful upgrade, etc.
|
||||
- Customer-side migration scripts that aren't idempotent will misbehave.
|
||||
- Document with examples.
|
||||
- Add a Score-builder lint that warns on common non-idempotent patterns (e.g. `INSERT` without `ON CONFLICT`).
|
||||
|
||||
### Done when
|
||||
|
||||
- `harmony-fleet-e2e` deploys a Deployment with one init container (`mkdir -p /data && touch /data/initialized`) followed by a main container that asserts the file exists.
|
||||
- Two-step ordering sequence tested.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 7 — System upgrade, rollback deferred (#7)
|
||||
|
||||
**Goal:** the device can apt-upgrade its base OS without bricking. Implements the upgrade flow per the drafted ADR; **the LVM-snapshot rollback half is deferred to v0.4.**
|
||||
|
||||
### What ships in v0.3
|
||||
|
||||
- **Pre-upgrade snapshot creation** (LVM thin snapshot of root LV). Created but **not used for revert** in v0.3.
|
||||
- **Boot-attempt counter on FAT `/boot` partition** (per ADR design).
|
||||
- **Userspace control-plane check-in timer.**
|
||||
- **Idempotent provisioning conversion script** (partition → PV/VG/LV, initramfs regen, cmdline.txt update, watchdog config).
|
||||
- **Canary hardware test of the upgrade-succeeds path.**
|
||||
|
||||
### What's explicitly NOT in v0.3
|
||||
|
||||
- **Initramfs `local-top` boot-attempt hook** that triggers rollback.
|
||||
- **Userspace soft-failure path** that merge-reverts the snapshot.
|
||||
- **Any rollback wiring.**
|
||||
|
||||
The snapshot exists so v0.4 can flip on the rollback half without re-provisioning devices.
|
||||
|
||||
### Done when
|
||||
|
||||
- Canary Pi successfully upgrades from a known-good base image to a later one.
|
||||
- Snapshot exists post-upgrade.
|
||||
- No customer-visible regression.
|
||||
- Per "Full Verification Before Done" rule: green on both aarch64 and x86_64 device classes.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 8 — Secrets via Zitadel + OpenBao (#8, deferred)
|
||||
|
||||
- **Lands when `harmony_secret` is ready.**
|
||||
- **Out of scope for v0.3 chapter-by-chapter work**, but **required before any production customer deploys an app that needs credentials**.
|
||||
- **Track as a separate item.** Surface to the customer as: "your first Deployments should use environment variables only until v0.3.x."
|
||||
|
||||
---
|
||||
|
||||
## Chapter 9 — Agent time-drift verification (#9)
|
||||
|
||||
**Goal:** agent refuses to operate (or warns loudly) when its clock is skewed enough to break JWT validation.
|
||||
|
||||
### Design
|
||||
|
||||
- **Startup NTP-style query** against a configurable server list (default: `time.cloudflare.com`, `pool.ntp.org`).
|
||||
- **Refuse to start on |drift| > 30s.** Typical JWT skew tolerance — past it, every NATS callout request fails with a cryptic `exp invalid`.
|
||||
- **Periodic re-check every 6 hours.** Mid-run drift past threshold → agent publishes a `DeviceInfo` health flag, dashboard surfaces it.
|
||||
- **Specific customer-facing error message:** "system clock skew is 14m32s; JWT validation will fail. Enable `systemd-timesyncd` or `chrony`."
|
||||
|
||||
### Done when
|
||||
|
||||
- Test in `harmony-fleet-e2e` runs against a libvirt VM with clock forced 5 minutes off.
|
||||
- Agent refuses to start with the expected error message.
|
||||
- Recovery: fix the clock → agent comes up clean.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 10 — Phase 1 smoke wiring (#10)
|
||||
|
||||
**Goal:** real fleet Scores carry real smoke tests. The Phase 0 contract becomes load-bearing.
|
||||
|
||||
### Scope
|
||||
|
||||
- **`HttpHealthy` probe** — GET a URL, expect 2xx, optional response-body-contains assertion.
|
||||
- **`K8sPodReady` probe** — kube client lookup for pod readiness condition.
|
||||
- **`NatsKvKeyExists` probe** — KV bucket + key, optional value-deserializes-to-T assertion.
|
||||
- **`FleetOperatorSmokeTest`** — pairs with `FleetOperatorScore`. Operator pod ready + `/healthz` returns 200 + can write to `device-info` KV.
|
||||
- **`FleetAgentSmokeTest`** — pairs with `FleetAgentScore`. Agent pod ready + heartbeat published to KV within 30s.
|
||||
- **`HarmonyEvent::SmokeStage{Started,Finished,Skipped}` variants** (additive) so the dashboard can render the live pipeline.
|
||||
- **Dashboard pipeline view** — maud renderer subscribing to instrumentation events.
|
||||
|
||||
### Sequencing within this chapter (strict order)
|
||||
|
||||
1. **`HarmonyEvent` variants** — one-line additive change to `harmony/src/domain/instrumentation.rs`.
|
||||
2. **Probes one at a time** — HTTP, K8sPodReady, NatsKvKeyExists. Each: unit tests + an integration test against the staging cluster.
|
||||
3. **`FleetOperatorSmokeTest`** composing the above.
|
||||
4. **`FleetAgentSmokeTest`.**
|
||||
5. **Dashboard renderer last** — once the events are flowing, UI is mostly maud + htmx polling.
|
||||
|
||||
### Done when
|
||||
|
||||
- `deploy_with_smoke(FleetOperatorScore, FleetOperatorSmokeTest, ...)` returns successfully against staging.
|
||||
- Dashboard shows the live pipeline.
|
||||
- Deliberate breakage (point the operator's helm chart at a bad image) → smoke fails visibly, failing probe named on dashboard.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 11 — CI yaml minimization (#11, longer-term)
|
||||
|
||||
Pulled out of the chapter-by-chapter v0.3 work.
|
||||
|
||||
- **Frame:** workflow yaml files in `.gitea/workflows/` (4 files, ~235 LOC) should hold only what Gitea Actions needs for job discovery + parallel viz. Job *bodies* are one-line calls into portable scripts.
|
||||
|
||||
### Direction
|
||||
|
||||
- **Build out a `harmony-ci` Rust CLI crate.** Commands like `harmony-ci build composer-linux`, `harmony-ci publish operator-image`, `harmony-ci check`.
|
||||
- **Each workflow yaml job becomes** `run: cargo run -p harmony-ci -- <command>`.
|
||||
- **Scripts must run identically from a developer's laptop.**
|
||||
|
||||
### Not in v0.3
|
||||
|
||||
- Multi-day effort; doesn't block the customer.
|
||||
- Slot when bandwidth allows.
|
||||
- Opportunistically convert when touching a workflow file for other reasons.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 12 — NATS callout CI hardening (#12, minimal)
|
||||
|
||||
- **`nats/callout` is a low-churn crate that works today.**
|
||||
- **Workspace-wide `cargo test`** in `.gitea/workflows/check.yml` covers the non-ignored tests.
|
||||
- **Four `#[ignore]`'d integration tests** in `nats/integration-test-callout/tests/callout_e2e.rs` need podman + a NATS image pull in the runner.
|
||||
|
||||
### Direction
|
||||
|
||||
- **Don't add CI infra in v0.3 just to run these.**
|
||||
- **When a runner with podman + image pull exists for other reasons** (e2e harness, system upgrade test matrix), add the callout integration tests to it.
|
||||
- **Until then:** keep current workspace-wide coverage.
|
||||
|
||||
---
|
||||
|
||||
## Out of scope for v0.3 (deferred deliberately)
|
||||
|
||||
| Item | Target | Why deferred |
|
||||
|-------------------------------------------------|-------------|--------------------------------------------------------------------|
|
||||
| Deployment-level auto-rollback | maybe never | Customer asked for roll-forward only. |
|
||||
| System-upgrade LVM-snapshot rollback half | v0.4 | Push to prod first; widen scope after. |
|
||||
| Live log tailing (streaming) | v0.4 | Chapter 3 ships sync `getLogs`; live tail builds on it. |
|
||||
| Deployment dependencies (cross-deploy ordering) | TBD | Init containers cover the common case; wait for customer ask. |
|
||||
| Secrets via Zitadel + OpenBao | v0.3.x | Blocked on `harmony_secret` work. |
|
||||
| Containerized agent (podman instead of systemd) | v0.4+ | Self-upgrade protocol matures first on systemd. |
|
||||
| Operator HA (active/active or active/passive) | TBD | One pod sufficient for v0.3; scale-out when fleet size demands. |
|
||||
| Multi-tenant fleet isolation tests | v0.4 | Callout permissions cover the mechanism; cross-tenant smoke later. |
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
These don't block starting v0.3 work but need resolution before the relevant chapter completes.
|
||||
|
||||
- **Q1 (Chapter 4): Binary distribution mechanism for agent upgrades.** Gitea releases vs OCI artifacts vs something else.
|
||||
- **Q2 (Chapter 2): Snapshot the aggregate to KV?** Faster recovery vs invalidation complexity.
|
||||
- **Q3 (Chapter 7): Canary test matrix?** Concretely: which Pi models, which base images, which apt sources.
|
||||
- **Q4 (Chapters 5 + 10): Sequencing of Chapter 10 vs Chapter 5.** Both benefit from smoke; right answer might be to ship Phase 1 smoke *during* Chapter 5 so upgrade gates on it. Decide when starting Chapter 5.
|
||||
- **Q5 (cross-cutting): One operator pod or active/passive?** Customer's fleet size answers this; ask before Chapter 2 starts.
|
||||
|
||||
---
|
||||
|
||||
## When v0.3 is done
|
||||
|
||||
- **All chapters 1–10 merged.**
|
||||
- **A real customer Deployment runs on a real Pi in a real basement.**
|
||||
- **The dashboard shows live status and logs.**
|
||||
- **An agent upgrade has been driven through the full protocol successfully** (and a failure path tested).
|
||||
- **A system upgrade has been driven through the full protocol on a canary.**
|
||||
|
||||
v0.4 picks up the deferred items in priority order.
|
||||
Reference in New Issue
Block a user