feat(fleet-deploy): smoke-test contract as a Score companion #292
491
ROADMAP/fleet_platform/v0_3_plan.md
Normal file
491
ROADMAP/fleet_platform/v0_3_plan.md
Normal file
@@ -0,0 +1,491 @@
|
||||
# Fleet Platform v0.3 — last-mile plan
|
||||
|
||||
Authoritative plan for the last mile before the fleet ships to a real
|
||||
customer. Picks up where `v0_2_plan.md` left the chapter structure.
|
||||
Written 2026-05-24, after `feat/iot-walking-skeleton` (#264) merged
|
||||
and `feat/smoke-test-contract` landed the Phase 0 smoke companion.
|
||||
|
||||
**The frame:**
|
||||
|
||||
- **v0.1 proved the shape.**
|
||||
- **v0.2 locked the brick design.**
|
||||
- **v0.3 makes the brick safe to hand to a customer** running production workloads on Pis in their basement.
|
||||
|
||||
## State coming in
|
||||
|
||||
- **IoT walking skeleton merged.** Operator + agent + NATS + Zitadel + auth callout running end-to-end against an OKD staging cluster.
|
||||
- **Smoke-test contract Phase 0 merged** (`feat/smoke-test-contract`).
|
||||
- `Probe` / `SmokeSuite` / `SmokeTest` companion + `deploy_with_smoke` in `harmony-fleet-deploy/src/companion/smoke/`.
|
||||
- One concrete probe today: `TcpReachable`.
|
||||
- No fleet Score wired to a real smoke test yet — Phase 1 is in this roadmap.
|
||||
- **Agent runs as a systemd user unit on devices** (see `harmony/src/modules/fleet/setup_score.rs:263–283`).
|
||||
- No on-device containerized agent path.
|
||||
- The Dockerfile in `fleet/harmony-fleet-agent/Dockerfile` is k8s-only today.
|
||||
- **Dashboard has no role enforcement — security gap.**
|
||||
- Maud/htmx frontend at `fleet/harmony-fleet-operator/src/frontend/server.rs`.
|
||||
- Verifies Zitadel JWT signature + expiry only.
|
||||
- `JwksCache::verify` (`harmony_zitadel_auth/src/jwks.rs:74`) extracts `sub`/`exp`/`email`/`name`/`nonce` — no roles.
|
||||
- `VerifiedSession` has no `roles` field.
|
||||
- Any logged-in Zitadel user gets full dashboard access. Fix immediately (Chapter 1).
|
||||
- **NATS callout already has the role-extraction logic we need.**
|
||||
- `ZitadelValidator::extract_roles` at `nats/callout/src/zitadel.rs:203`.
|
||||
- Handles both array shape (`["fleet-admin"]`) and Zitadel's object-map shape (`{"fleet-admin": {org_id: org_name}}`).
|
||||
- `roles::resolve` maps role names to `ResolvedRole::Admin`/`::Device` with admin-wins privilege escalation.
|
||||
- Chapter 1 reuses the extractor, not the role-to-NATS-permission half.
|
||||
- **System upgrade ADR drafted** at `docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md`.
|
||||
- Header says Accepted 2026-05-24 but lives under `drafts/`.
|
||||
- Authoritative status: approach agreed, rollback half deferred (Chapter 7).
|
||||
|
||||
## Customer constraints baked into this plan
|
||||
|
||||
- **Deployments are roll-forward only.** No auto-rollback when a new Deployment version fails. Dashboard surfaces the failure; customer edits the spec and rolls forward. Customer ask; may change later, not in v0.3.
|
||||
- **System rollback is deferred to v0.4.** v0.3 implements *upgrade* per the ADR; the LVM-snapshot rollback half waits until we've shipped something to production.
|
||||
- **Secrets need Zitadel + OpenBao.** No plaintext-env-var shortcut. `harmony_secret` + OpenBao work is on the critical path for any Deployment that needs credentials.
|
||||
|
||||
## Feature checklist
|
||||
|
||||
Status legend: ✅ shipped · 🟡 in flight · 🔴 not started · ⏸ deferred (target version in note).
|
||||
|
||||
| # | Feature | Status | Owner / branch | Notes |
|
||||
|----|----------------------------------------------------------|---------|---------------------------------|-------|
|
||||
| 1 | Dashboard role enforcement (`fleet-admin` required) | 🔴 | next branch | Reuse `ZitadelValidator::extract_roles`. Do this **right now** — security gap. |
|
||||
| 2 | Operator restart / aggregator cold-rebuild | 🔴 | next branch | More critical than smoke wiring; ship before any customer. |
|
||||
| 3 | Deployment `getLogs` companion + dashboard log view | 🔴 | next branch | "Makes dashboard useful rather than a toy." Score companion shape. |
|
||||
| 4 | Agent self-upgrade (NATS-coordinated, systemd-resident) | 🔴 | new branch | Marker lives in NATS, not on disk. Systemd stays. |
|
||||
| 5 | Graceful deployment upgrade (roll-forward only) | 🔴 | new branch | SIGTERM → grace → SIGKILL fallback → start new. No rollback. |
|
||||
| 6 | Init containers in `PodmanV0Score` | 🔴 | new branch | Ordered, run-to-completion, customer guarantees idempotency. |
|
||||
| 7 | System upgrade (no rollback yet) | 🔴 | new branch | Per drafted ADR, minus the LVM-snapshot rollback half. |
|
||||
| 8 | Secrets via Zitadel + OpenBao for Deployments | ⏸ v0.3+ | blocked on `harmony_secret` | Required for production but not blocking the first customer. |
|
||||
| 9 | Agent time-drift verification | 🔴 | new branch | Periodic NTP check; refuse JWT operations if skewed. |
|
||||
| 10 | Phase 1 smoke wiring (HTTP / K8sPodReady / NatsKv probes) | 🔴 | new branch | After required features land. Not a functional blocker. |
|
||||
| 11 | CI yaml minimization (logic into `harmony-ci` scripts) | ⏸ v0.4 | longer-term | Yaml stays for discovery + parallel viz; scripts move. |
|
||||
| 12 | NATS callout CI hardening | ⏸ | low-churn crate | Already covered by workspace `cargo test`. Run ignored tests when CI has podman + NATS image. |
|
||||
| 13 | Application log streaming through NATS | ⏸ v0.4 | follow-on to #3 | #3 is the synchronous `getLogs`; this is the live tail. |
|
||||
| 14 | Deployment dependencies (`after: [...]`) | ⏸ | not chosen | Init containers (#6) cover the in-deployment case; defer until customers ask. |
|
||||
|
||||
## Sequencing
|
||||
|
||||
| Order | Item | Why |
|
||||
|-------|---------------------------------------------------|--------------------------------------------------------------------------------------------------|
|
||||
| 1 | #1 Dashboard role enforcement | Security gap, do right now. |
|
||||
| 2 | #2 Operator restart recovery | More critical than smoke wiring. Customer can't tolerate "operator restarted, state unknown." |
|
||||
| 3 | #3 Log forwarding companion | Turns the dashboard from a toy into a thing customers actually use. |
|
||||
| 4 | #4 Agent self-upgrade | Parallel-safe with #2/#3 — different code paths. |
|
||||
| 5 | #5 + #6 Graceful upgrade + init containers | Paired Deployment-layer features; ship together. |
|
||||
| 6 | #9 Time-drift verification | Small, isolated; slot between heavier items. |
|
||||
| 7 | #7 System upgrade | Builds on agent-upgrade pattern from #4 — #4 lands first. |
|
||||
| 8 | #10 Phase 1 smoke wiring | After required features so probes verify real customer-facing surfaces. |
|
||||
| 9 | #8 Secrets | Blocks any customer Deployment that needs credentials. Promote if first customer needs them. |
|
||||
| 10 | #11 / #12 CI | Opportunistic, doesn't block customer. |
|
||||
|
||||
---
|
||||
|
||||
## Chapter 1 — Dashboard role enforcement (#1)
|
||||
|
||||
**Goal:** every dashboard page requires a valid Zitadel session **and** a `fleet-admin` role on the token.
|
||||
- Users without the role get a 403 with a clear message.
|
||||
- Users without a session get the existing login redirect.
|
||||
|
||||
### Current state
|
||||
|
||||
- **JWKS verify only extracts identity claims.** `JwksCache::verify` (`harmony_zitadel_auth/src/jwks.rs:74`) parses the JWT and returns a `VerifiedSession` with `sub`/`exp`/`email`/`name`/`nonce`. Roles not extracted.
|
||||
- **`VerifiedSession` has no `roles` field** (`harmony_zitadel_auth/src/session.rs:5`).
|
||||
- **Middleware checks JWT validity only.** `require_auth` (`fleet/harmony-fleet-operator/src/frontend/server.rs:136–157`). Every authenticated user gets all pages.
|
||||
- **Role extraction logic already exists and is correct** in the callout: `ZitadelValidator::extract_roles` at `nats/callout/src/zitadel.rs:203`. Handles both shapes:
|
||||
- array — `["fleet-admin"]`
|
||||
- object-map — `{"fleet-admin": {org_id: org_name}}`
|
||||
|
||||
### Plan
|
||||
|
||||
1. **Extract a shared role-extraction helper into `harmony_zitadel_auth`** so dashboard and callout import from one place. Callout keeps its API but its body delegates.
|
||||
2. **Extend `VerifiedSession`** with `roles: Vec<String>`.
|
||||
3. **Extend the JWKS `Claims` decode struct** to capture the configured roles claim. Pull the claim name from existing callout config so the two systems agree (Zitadel ships `urn:zitadel:iam:org:project:roles` or similar).
|
||||
4. **Add `require_role(role: &'static str)` middleware** to the dashboard. Compose with `require_auth`. Use on every `Router::route(..., post|get(...).layer(...))`.
|
||||
5. **403 response renders a maud page** — "fleet-admin role required; ask your administrator." Not a JSON error; dashboard is human-facing.
|
||||
|
||||
### Tests
|
||||
|
||||
Security code — heavy unit tests are non-negotiable.
|
||||
|
||||
- **Array-shape claim → fleet-admin in session.** JWT with array-shape role claim.
|
||||
- **Object-map shape → identical resolution.** Same role, Zitadel's other claim shape.
|
||||
- **No role claim → empty roles.** Token with no `roles` claim.
|
||||
- **Wrong role doesn't elevate.** JWT with only `device` role does NOT carry `fleet-admin`.
|
||||
- **No session → 401/redirect.**
|
||||
- **Session but no `fleet-admin` → 403.**
|
||||
- **Session + `fleet-admin` → 200.**
|
||||
|
||||
### Done when
|
||||
|
||||
- Branch merged.
|
||||
- All dashboard handlers gated by `require_role("fleet-admin")`.
|
||||
- Every test green.
|
||||
- Manual smoke against staging Zitadel: user without role sees 403.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 2 — Operator restart + aggregator recovery (#2)
|
||||
|
||||
**Goal:** the operator pod can be killed, upgraded, or rescheduled at any time and the system converges back to correct state from NATS KV alone. No "unknown state" window visible to customers.
|
||||
|
||||
### Current state
|
||||
|
||||
- **Aggregator rebuilds from scratch on startup.** `fleet_aggregator.rs` (833 LOC, in `harmony-fleet-operator/src/`) watches the KV buckets to materialize state. JG confirmed: "rebuilt from scratch, yes."
|
||||
- **Failure modes not exercised yet:**
|
||||
- Partial KV — device offline during operator reset, never re-published its info.
|
||||
- Two operator pods racing during a rolling deploy of the operator.
|
||||
- NATS stream loss between operator restart and rebuild completing.
|
||||
- Stale KV — Deployment CR deleted in kube while operator was down.
|
||||
|
||||
### Plan
|
||||
|
||||
**Scenario-driven.** Enumerate failure shapes, then handle one at a time. Discipline: each scenario gets a regression test in `harmony-fleet-e2e`, then the fix.
|
||||
|
||||
1. **Scenario inventory pass.** Write `docs/fleet-operator-recovery-scenarios.md` listing every failure shape we can think of. Cross-reference smoke-a* tests to identify what's already covered.
|
||||
2. **Cold-start rebuild as the baseline.** Confirm + test that `kubectl delete pod` of the operator and waiting for the replacement converges to pre-kill aggregate in < 30s. Gate on convergence time at N device count.
|
||||
3. **Stale-KV reconciliation.** Define the rule for "KV says device D has Deployment X, but Deployment X no longer exists in kube." Operator cleans up; agents observe the deletion.
|
||||
4. **Leader election decision.** Ship with leader election (one writer at a time) or design for idempotent multi-writer? Score-Topology-Interpret leans idempotent; confirm + assert operator writes are byte-deterministic.
|
||||
5. **Liveness signaling for the dashboard.** Surface "operator converged" / "operator recovering" as states the frontend renders. Customer sees a loading banner, not a blank dashboard, during rebuild.
|
||||
|
||||
### Open questions
|
||||
|
||||
- **Warm-restart snapshot?** Keep a per-operator-pod "last known aggregate" snapshot in a KV bucket so warm restarts skip cold rebuild? Probably yes for >1000-device fleets; adds an invalidation problem.
|
||||
- **One pod or active/passive?** Customer's fleet size answers this. Ask before starting.
|
||||
|
||||
### Done when
|
||||
|
||||
- Scenario inventory exists.
|
||||
- Each scenario has a regression test, all green.
|
||||
- Manual chaos: kill operator pod during high write load → convergence + dashboard liveness banner observed.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 3 — Application log forwarding companion (#3)
|
||||
|
||||
**Goal:** when a customer's Deployment is misbehaving on a Pi in the field, the dashboard shows last-N-lines of container logs without anyone SSH-ing the device.
|
||||
|
||||
### Design
|
||||
|
||||
- **Logs attach as a Score companion** — same pattern as the smoke-test contract.
|
||||
- **The companion is optional** — Scores without one render "this deployment doesn't expose logs". Acceptable.
|
||||
- **Sync `getLogs` ships in v0.3; live tail (streaming) waits for v0.4** — that's the minimum useful UX.
|
||||
|
||||
Shape:
|
||||
|
||||
```rust
|
||||
// new in harmony-fleet-deploy/src/companion/logs/
|
||||
pub trait LogQuery<T: Topology>: Send + Sync {
|
||||
type Score: Score<T>;
|
||||
async fn last_lines(
|
||||
&self,
|
||||
score: &Self::Score,
|
||||
topology: &T,
|
||||
n: usize,
|
||||
) -> Result<LogChunk, LogQueryError>;
|
||||
}
|
||||
|
||||
pub struct LogChunk {
|
||||
pub source: ProbeName, // reuse the validated newtype
|
||||
pub captured_at: chrono::DateTime<chrono::Utc>,
|
||||
pub lines: Vec<String>,
|
||||
pub truncated: bool,
|
||||
}
|
||||
```
|
||||
|
||||
For `PodmanV0Score`:
|
||||
- **Transport: NATS request/reply.** Subject `device-commands.<device_id>.logs.<deployment>`.
|
||||
- **Agent side:** runs `podman logs --tail N <container>` and replies with a `LogChunk`.
|
||||
- **Dashboard side:** one async call from the logs handler.
|
||||
|
||||
### Plan
|
||||
|
||||
1. **Define `LogQuery` companion trait** in a new `harmony-fleet-deploy/src/companion/logs/` module.
|
||||
2. **`PodmanLogQuery`** implementing `LogQuery<…> for PodmanV0Score`.
|
||||
3. **Agent-side command handler** — extend the existing request/reply command dispatcher.
|
||||
4. **Dashboard handler** at `/deployments/<name>/devices/<id>/logs?lines=N` returning rendered maud.
|
||||
5. **Tests:** unit on `PodmanLogQuery`; integration in `harmony-fleet-e2e` driving end-to-end.
|
||||
|
||||
### Done when
|
||||
|
||||
- Customer clicks "View logs" on the dashboard.
|
||||
- Sees the last 200 lines.
|
||||
- Call returns in < 2s on a 3-device fleet.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 4 — Agent self-upgrade, NATS-coordinated (#4)
|
||||
|
||||
**Goal:** the agent can upgrade itself in place. If NATS is unavailable, the upgrade does not start. The operator sees every step.
|
||||
|
||||
### Design (per JG's direction)
|
||||
|
||||
- **Stay on systemd for v0.3.** Switching the agent runtime to podman is its own risk; defer until self-upgrade protocol matures.
|
||||
- **Upgrade marker lives in NATS, not on disk.** New KV bucket `agent-upgrade` keyed by `device_id`, carrying `start_timestamp`, `invoker_version`, `target_version`, `phase`.
|
||||
- **No NATS → no upgrade.** Feature, not limitation: operator can't observe an upgrade it can't see, so refusing without NATS prevents silent half-upgrades.
|
||||
|
||||
### Protocol
|
||||
|
||||
1. **Operator writes `Requested`.** `agent-upgrade/<device_id>` with `phase: Requested, target_version: vX`.
|
||||
2. **Old agent observes + writes `Suspending`.** Verifies NATS liveness with a round-trip first.
|
||||
3. **Old agent suspends + writes `Suspended`.** Reconcile loop paused; heartbeat continues so the operator knows it's alive.
|
||||
4. **Old agent fetches new binary + writes `Fetched`.** Mechanism TBD (see open questions). `target_path: /usr/local/bin/fleet-agent.new`.
|
||||
5. **Old agent launches new binary as a separate process + writes `NewLaunched`.** Not via systemd unit update yet. Includes `new_pid: N`.
|
||||
6. **New agent self-checks + writes `NewHealthy`.** Connects to NATS, verifies permissions, one-shot smoke (KV read, command channel echo).
|
||||
7. **Old agent writes `HandingOff` and exits.** Tells systemd to swap the binary: `systemctl daemon-reload` + `systemctl restart fleet-agent.service` with the new binary in place.
|
||||
8. **Systemd starts the unit pointing at the new binary.** Final state `phase: Complete, completed_at: T`.
|
||||
|
||||
**On stall (configurable, default 5 min):**
|
||||
- Marker writes `phase: Failed` with last successful step.
|
||||
- Operator surfaces this on the dashboard.
|
||||
- Customer / operator intervenes manually — **no auto-rollback in v0.3**, consistent with the deployment roll-forward-only rule.
|
||||
|
||||
### Open questions
|
||||
|
||||
- **Q1.1 Binary distribution.** Gitea release asset? Signed OCI artifact? Existing `arm-agents.yaml` uploads aarch64 binaries to releases — start with that.
|
||||
- **Q1.2 Verification.** Hash signature? GPG? Minimum: SHA-256 pinned in the upgrade-request payload.
|
||||
- **Q1.3 Atomic systemd swap.** `systemctl restart` is not atomic across binary-on-disk and process. Acceptable? Or `systemd-run --transient` shim?
|
||||
- **Q1.4 Cross-arch.** Fetch URL has to know the device's arch. KV `device-info` already carries this; confirm the agent reads its own arch correctly.
|
||||
|
||||
### Done when
|
||||
|
||||
- Branch contains the protocol implementation + e2e test driving v0.3.0 → v0.3.1 upgrade against a libvirt VM.
|
||||
- Operator sees every phase.
|
||||
- Failure path tested: deliberately corrupt the new binary → marker reads `Failed`, old agent stays running.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)
|
||||
|
||||
**Goal:** upgrading a Deployment's image/config replaces the old container without dropping traffic mid-request. If the new container won't start, the customer sees the failure clearly and fixes the spec.
|
||||
|
||||
### Design
|
||||
|
||||
Extend `PodmanV0Score` with a `lifecycle` block:
|
||||
|
||||
```rust
|
||||
pub struct PodmanV0Score {
|
||||
// ... existing fields ...
|
||||
pub lifecycle: Option<LifecyclePolicy>,
|
||||
}
|
||||
|
||||
pub struct LifecyclePolicy {
|
||||
pub stop_signal: StopSignal, // SIGTERM (default), SIGINT, SIGUSR1
|
||||
pub grace_period: Duration, // default 30s
|
||||
pub sigkill_fallback: bool, // default true
|
||||
}
|
||||
```
|
||||
|
||||
Agent's reconcile when image/config changes:
|
||||
|
||||
1. **Write `Upgrading` phase.** New `DeploymentState::Phase::Upgrading` variant. Dashboard shows the in-progress upgrade.
|
||||
2. **Send `stop_signal` to the old container.**
|
||||
3. **Wait up to `grace_period` for clean exit.**
|
||||
4. **SIGKILL fallback** if still running and `sigkill_fallback`.
|
||||
5. **Start new container.**
|
||||
6. **On startup failure: write `Failed` and stop.** Image pull error, exec error, crash within 5s. No revert to old image.
|
||||
7. **On success: write `Running`.** Optionally gated by a Phase-1 smoke test (Chapter 10) when that lands.
|
||||
|
||||
### Explicit non-goals
|
||||
|
||||
- **No auto-rollback.** Customer-asked constraint. Step 6 firing → dashboard shows "Deployment failed; previous version stopped" and the customer edits the spec.
|
||||
- **No "stale + new" window.** Single container per Deployment per device; short downtime during cutover is accepted.
|
||||
|
||||
### Done when
|
||||
|
||||
- Upgrade test in `harmony-fleet-e2e` walks v1 → v2 → v3 image swap with controlled failures.
|
||||
- Dashboard reflects every step.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 6 — Init containers (#6)
|
||||
|
||||
**Goal:** customer can declare init containers that run to completion before the main container starts. Mirror Kubernetes semantics so customer mental model transfers.
|
||||
|
||||
### Design
|
||||
|
||||
Extend `PodmanV0Score` with `init_containers: Vec<InitContainer>`:
|
||||
- **Ordered** — declaration order = run order.
|
||||
- **Run-to-completion** — each one must exit zero before the next starts.
|
||||
- **Fail-the-Deployment on init failure** — non-zero exit or timeout exceeded.
|
||||
|
||||
```rust
|
||||
pub struct InitContainer {
|
||||
pub name: String,
|
||||
pub image: String,
|
||||
pub args: Vec<String>,
|
||||
pub env: Vec<EnvVar>,
|
||||
pub volumes: Vec<VolumeMount>,
|
||||
pub timeout: Duration, // default 5 min
|
||||
}
|
||||
```
|
||||
|
||||
### Customer contract (document loudly)
|
||||
|
||||
**Init containers must be idempotent.** They run on every reconcile that requires a fresh main container — power-cycle recovery, graceful upgrade, etc.
|
||||
- Customer-side migration scripts that aren't idempotent will misbehave.
|
||||
- Document with examples.
|
||||
- Add a Score-builder lint that warns on common non-idempotent patterns (e.g. `INSERT` without `ON CONFLICT`).
|
||||
|
||||
### Done when
|
||||
|
||||
- `harmony-fleet-e2e` deploys a Deployment with one init container (`mkdir -p /data && touch /data/initialized`) followed by a main container that asserts the file exists.
|
||||
- Two-step ordering sequence tested.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 7 — System upgrade, rollback deferred (#7)
|
||||
|
||||
**Goal:** the device can apt-upgrade its base OS without bricking. Implements the upgrade flow per the drafted ADR; **the LVM-snapshot rollback half is deferred to v0.4.**
|
||||
|
||||
### What ships in v0.3
|
||||
|
||||
- **Pre-upgrade snapshot creation** (LVM thin snapshot of root LV). Created but **not used for revert** in v0.3.
|
||||
- **Boot-attempt counter on FAT `/boot` partition** (per ADR design).
|
||||
- **Userspace control-plane check-in timer.**
|
||||
- **Idempotent provisioning conversion script** (partition → PV/VG/LV, initramfs regen, cmdline.txt update, watchdog config).
|
||||
- **Canary hardware test of the upgrade-succeeds path.**
|
||||
|
||||
### What's explicitly NOT in v0.3
|
||||
|
||||
- **Initramfs `local-top` boot-attempt hook** that triggers rollback.
|
||||
- **Userspace soft-failure path** that merge-reverts the snapshot.
|
||||
- **Any rollback wiring.**
|
||||
|
||||
The snapshot exists so v0.4 can flip on the rollback half without re-provisioning devices.
|
||||
|
||||
### Done when
|
||||
|
||||
- Canary Pi successfully upgrades from a known-good base image to a later one.
|
||||
- Snapshot exists post-upgrade.
|
||||
- No customer-visible regression.
|
||||
- Per "Full Verification Before Done" rule: green on both aarch64 and x86_64 device classes.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 8 — Secrets via Zitadel + OpenBao (#8, deferred)
|
||||
|
||||
- **Lands when `harmony_secret` is ready.**
|
||||
- **Out of scope for v0.3 chapter-by-chapter work**, but **required before any production customer deploys an app that needs credentials**.
|
||||
- **Track as a separate item.** Surface to the customer as: "your first Deployments should use environment variables only until v0.3.x."
|
||||
|
||||
---
|
||||
|
||||
## Chapter 9 — Agent time-drift verification (#9)
|
||||
|
||||
**Goal:** agent refuses to operate (or warns loudly) when its clock is skewed enough to break JWT validation.
|
||||
|
||||
### Design
|
||||
|
||||
- **Startup NTP-style query** against a configurable server list (default: `time.cloudflare.com`, `pool.ntp.org`).
|
||||
- **Refuse to start on |drift| > 30s.** Typical JWT skew tolerance — past it, every NATS callout request fails with a cryptic `exp invalid`.
|
||||
- **Periodic re-check every 6 hours.** Mid-run drift past threshold → agent publishes a `DeviceInfo` health flag, dashboard surfaces it.
|
||||
- **Specific customer-facing error message:** "system clock skew is 14m32s; JWT validation will fail. Enable `systemd-timesyncd` or `chrony`."
|
||||
|
||||
### Done when
|
||||
|
||||
- Test in `harmony-fleet-e2e` runs against a libvirt VM with clock forced 5 minutes off.
|
||||
- Agent refuses to start with the expected error message.
|
||||
- Recovery: fix the clock → agent comes up clean.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 10 — Phase 1 smoke wiring (#10)
|
||||
|
||||
**Goal:** real fleet Scores carry real smoke tests. The Phase 0 contract becomes load-bearing.
|
||||
|
||||
### Scope
|
||||
|
||||
- **`HttpHealthy` probe** — GET a URL, expect 2xx, optional response-body-contains assertion.
|
||||
- **`K8sPodReady` probe** — kube client lookup for pod readiness condition.
|
||||
- **`NatsKvKeyExists` probe** — KV bucket + key, optional value-deserializes-to-T assertion.
|
||||
- **`FleetOperatorSmokeTest`** — pairs with `FleetOperatorScore`. Operator pod ready + `/healthz` returns 200 + can write to `device-info` KV.
|
||||
- **`FleetAgentSmokeTest`** — pairs with `FleetAgentScore`. Agent pod ready + heartbeat published to KV within 30s.
|
||||
- **`HarmonyEvent::SmokeStage{Started,Finished,Skipped}` variants** (additive) so the dashboard can render the live pipeline.
|
||||
- **Dashboard pipeline view** — maud renderer subscribing to instrumentation events.
|
||||
|
||||
### Sequencing within this chapter (strict order)
|
||||
|
||||
1. **`HarmonyEvent` variants** — one-line additive change to `harmony/src/domain/instrumentation.rs`.
|
||||
2. **Probes one at a time** — HTTP, K8sPodReady, NatsKvKeyExists. Each: unit tests + an integration test against the staging cluster.
|
||||
3. **`FleetOperatorSmokeTest`** composing the above.
|
||||
4. **`FleetAgentSmokeTest`.**
|
||||
5. **Dashboard renderer last** — once the events are flowing, UI is mostly maud + htmx polling.
|
||||
|
||||
### Done when
|
||||
|
||||
- `deploy_with_smoke(FleetOperatorScore, FleetOperatorSmokeTest, ...)` returns successfully against staging.
|
||||
- Dashboard shows the live pipeline.
|
||||
- Deliberate breakage (point the operator's helm chart at a bad image) → smoke fails visibly, failing probe named on dashboard.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 11 — CI yaml minimization (#11, longer-term)
|
||||
|
||||
Pulled out of the chapter-by-chapter v0.3 work.
|
||||
|
||||
- **Frame:** workflow yaml files in `.gitea/workflows/` (4 files, ~235 LOC) should hold only what Gitea Actions needs for job discovery + parallel viz. Job *bodies* are one-line calls into portable scripts.
|
||||
|
||||
### Direction
|
||||
|
||||
- **Build out a `harmony-ci` Rust CLI crate.** Commands like `harmony-ci build composer-linux`, `harmony-ci publish operator-image`, `harmony-ci check`.
|
||||
- **Each workflow yaml job becomes** `run: cargo run -p harmony-ci -- <command>`.
|
||||
- **Scripts must run identically from a developer's laptop.**
|
||||
|
||||
### Not in v0.3
|
||||
|
||||
- Multi-day effort; doesn't block the customer.
|
||||
- Slot when bandwidth allows.
|
||||
- Opportunistically convert when touching a workflow file for other reasons.
|
||||
|
||||
---
|
||||
|
||||
## Chapter 12 — NATS callout CI hardening (#12, minimal)
|
||||
|
||||
- **`nats/callout` is a low-churn crate that works today.**
|
||||
- **Workspace-wide `cargo test`** in `.gitea/workflows/check.yml` covers the non-ignored tests.
|
||||
- **Four `#[ignore]`'d integration tests** in `nats/integration-test-callout/tests/callout_e2e.rs` need podman + a NATS image pull in the runner.
|
||||
|
||||
### Direction
|
||||
|
||||
- **Don't add CI infra in v0.3 just to run these.**
|
||||
- **When a runner with podman + image pull exists for other reasons** (e2e harness, system upgrade test matrix), add the callout integration tests to it.
|
||||
- **Until then:** keep current workspace-wide coverage.
|
||||
|
||||
---
|
||||
|
||||
## Out of scope for v0.3 (deferred deliberately)
|
||||
|
||||
| Item | Target | Why deferred |
|
||||
|-------------------------------------------------|-------------|--------------------------------------------------------------------|
|
||||
| Deployment-level auto-rollback | maybe never | Customer asked for roll-forward only. |
|
||||
| System-upgrade LVM-snapshot rollback half | v0.4 | Push to prod first; widen scope after. |
|
||||
| Live log tailing (streaming) | v0.4 | Chapter 3 ships sync `getLogs`; live tail builds on it. |
|
||||
| Deployment dependencies (cross-deploy ordering) | TBD | Init containers cover the common case; wait for customer ask. |
|
||||
| Secrets via Zitadel + OpenBao | v0.3.x | Blocked on `harmony_secret` work. |
|
||||
| Containerized agent (podman instead of systemd) | v0.4+ | Self-upgrade protocol matures first on systemd. |
|
||||
| Operator HA (active/active or active/passive) | TBD | One pod sufficient for v0.3; scale-out when fleet size demands. |
|
||||
| Multi-tenant fleet isolation tests | v0.4 | Callout permissions cover the mechanism; cross-tenant smoke later. |
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
These don't block starting v0.3 work but need resolution before the relevant chapter completes.
|
||||
|
||||
- **Q1 (Chapter 4): Binary distribution mechanism for agent upgrades.** Gitea releases vs OCI artifacts vs something else.
|
||||
- **Q2 (Chapter 2): Snapshot the aggregate to KV?** Faster recovery vs invalidation complexity.
|
||||
- **Q3 (Chapter 7): Canary test matrix?** Concretely: which Pi models, which base images, which apt sources.
|
||||
- **Q4 (Chapters 5 + 10): Sequencing of Chapter 10 vs Chapter 5.** Both benefit from smoke; right answer might be to ship Phase 1 smoke *during* Chapter 5 so upgrade gates on it. Decide when starting Chapter 5.
|
||||
- **Q5 (cross-cutting): One operator pod or active/passive?** Customer's fleet size answers this; ask before Chapter 2 starts.
|
||||
|
||||
---
|
||||
|
||||
## When v0.3 is done
|
||||
|
||||
- **All chapters 1–10 merged.**
|
||||
- **A real customer Deployment runs on a real Pi in a real basement.**
|
||||
- **The dashboard shows live status and logs.**
|
||||
- **An agent upgrade has been driven through the full protocol successfully** (and a failure path tested).
|
||||
- **A system upgrade has been driven through the full protocol on a canary.**
|
||||
|
||||
v0.4 picks up the deferred items in priority order.
|
||||
@@ -8,12 +8,18 @@
|
||||
//! checks. Companions never call into the framework or topology;
|
||||
//! they're pure functions over data the Score already exposes.
|
||||
//!
|
||||
//! Today: one companion, [`AgentObservation`], used by the e2e
|
||||
//! harness to derive the device-scoped KV watch filter from a typed
|
||||
//! [`harmony_fleet_auth::AgentConfig`]. Future companions for the
|
||||
//! operator's CRD ownership + the readiness/smoke-test contract
|
||||
//! land alongside this module.
|
||||
//! Today:
|
||||
//!
|
||||
//! - [`AgentObservation`] — derived view of a fleet agent's KV watch
|
||||
//! filter, computed from a typed `AgentConfig`. Used by the e2e
|
||||
//! harness to write desired-state keys the agent will pick up.
|
||||
//! - [`smoke`] — the smoke-test contract. `Probe` trait, `SmokeSuite`
|
||||
//! composition, `SmokeTest` companion, and the `deploy`/
|
||||
//! `deploy_with_smoke` wrappers that bind a Score's interpret to
|
||||
//! an optional smoke run. Implements ADR-023 P4 ("deploy returns
|
||||
//! only after smoke-test success") via the companion seam from P7.
|
||||
|
||||
pub mod agent;
|
||||
pub mod smoke;
|
||||
|
||||
pub use agent::AgentObservation;
|
||||
|
||||
82
fleet/harmony-fleet-deploy/src/companion/smoke/contract.rs
Normal file
82
fleet/harmony-fleet-deploy/src/companion/smoke/contract.rs
Normal file
@@ -0,0 +1,82 @@
|
||||
//! The Score → SmokeTest pairing contract.
|
||||
//!
|
||||
//! [`SmokeTest`] is the companion trait that pairs a [`Score`] with
|
||||
//! the probes that verify *that specific* Score's deployment worked.
|
||||
//! The pairing is type-level via the `Score` associated type — the
|
||||
//! compiler refuses to call
|
||||
//! [`deploy_with_smoke`](super::deploy::deploy_with_smoke) with a
|
||||
//! `FleetOperatorSmokeTest` for a `DnsScore`, because their
|
||||
//! associated types don't match. This is the same trick that makes
|
||||
//! `K8sBareTopology: PostgreSQL` an unsatisfied trait bound when you
|
||||
//! accidentally try to run PostgreSQL on a topology that doesn't
|
||||
//! advertise it — see ADR-024 §2 and JG's *For the Love of
|
||||
//! Compilers* talk.
|
||||
//!
|
||||
//! ## Why not a method on `Score`?
|
||||
//!
|
||||
//! Per ADR-023 P7, framework capabilities attach as companions, not
|
||||
//! as new methods on `Score`/`Interpret`. The base trait surface
|
||||
//! stays one method (`create_interpret`). Smoke is *something paired
|
||||
//! with* a Score, not *something on* a Score. The orthogonal
|
||||
//! benefit: a Score can be deployed without smoke, can be deployed
|
||||
//! with one of several smoke configurations (dev, prod, e2e), or
|
||||
//! can have its smoke evolved independently of its interpret logic.
|
||||
//!
|
||||
//! ## Why `assemble` returns a value, not a future-of-checks?
|
||||
//!
|
||||
//! The probes a smoke test wants to run are usually known up front
|
||||
//! from the Score's data — "the service name we just deployed,
|
||||
//! check it on the cluster's default DNS". Resolving that into a
|
||||
//! concrete `SmokeSuite` *before* the probes run means the dashboard
|
||||
//! can render the pipeline (probe names, stage count) the moment
|
||||
//! smoke starts. If it returned a `Stream<Item = ProbeReport>`, the
|
||||
//! pipeline shape wouldn't be knowable until the run completed.
|
||||
|
||||
use async_trait::async_trait;
|
||||
use thiserror::Error;
|
||||
|
||||
use harmony::score::Score;
|
||||
use harmony::topology::Topology;
|
||||
|
||||
use super::suite::SmokeSuite;
|
||||
|
||||
/// Pair a [`Score`] with the [`SmokeSuite`] that proves its deploy
|
||||
/// converged.
|
||||
///
|
||||
/// Implementors live next to the Score they pair with — the
|
||||
/// `FleetOperatorSmokeTest` ships from this crate alongside
|
||||
/// `FleetOperatorScore`, the same way [`crate::AgentObservation`]
|
||||
/// lives alongside `FleetAgentScore`. This is the companion
|
||||
/// placement rule from ADR-023 §"Extend Scores with companions".
|
||||
#[async_trait]
|
||||
pub trait SmokeTest<T: Topology>: Send + Sync {
|
||||
/// The Score this smoke test verifies. The type lock means
|
||||
/// `SM::Score = S` is enforced at every call site.
|
||||
type Score: Score<T>;
|
||||
|
|
||||
|
||||
/// Build the [`SmokeSuite`] that will run after the Score's
|
||||
/// `interpret` succeeds. Has access to the Score so probes can
|
||||
/// be parameterized on the deploy's actual inputs (service
|
||||
/// names, namespaces, ports), and to the topology so probes can
|
||||
/// pull credentials / endpoints from declared capabilities.
|
||||
async fn assemble(
|
||||
&self,
|
||||
score: &Self::Score,
|
||||
topology: &T,
|
||||
) -> Result<SmokeSuite<T>, SmokeAssemblyError>;
|
||||
}
|
||||
|
||||
/// Why a smoke test failed to *build* (before any probe runs).
|
||||
///
|
||||
/// Distinct from probe failure — assembly failure means the Score
|
||||
/// is in an unexpected shape (missing endpoint, can't resolve a
|
||||
/// reference). The dashboard shows this differently from a probe
|
||||
/// failure: it's an authoring / configuration error, not a deployed-
|
||||
/// system error.
|
||||
#[derive(Debug, Clone, Error, PartialEq, Eq)]
|
||||
pub enum SmokeAssemblyError {
|
||||
#[error("smoke test could not derive endpoint for {what}: {detail}")]
|
||||
MissingEndpoint { what: String, detail: String },
|
||||
#[error("smoke test assembly failed: {0}")]
|
||||
Other(String),
|
||||
}
|
||||
387
fleet/harmony-fleet-deploy/src/companion/smoke/deploy.rs
Normal file
387
fleet/harmony-fleet-deploy/src/companion/smoke/deploy.rs
Normal file
@@ -0,0 +1,387 @@
|
||||
//! `deploy` and `deploy_with_smoke` — the framework-glue wrappers
|
||||
//! that bind a [`Score`]'s interpret to an optional [`SmokeTest`].
|
||||
//!
|
||||
//! These are free functions, not methods on [`Score`]/[`Maestro`]
|
||||
//! /[`Interpret`]. The framework's domain directory is left
|
||||
//! untouched in this PR (ADR-023 P7 — "extend Scores with companions,
|
||||
//! not API changes"). If a later PR proves this shape, promoting
|
||||
//! `deploy_with_smoke` to a `Maestro` convenience is a one-line
|
||||
//! additive change.
|
||||
//!
|
||||
//! ## Opting in or out
|
||||
//!
|
||||
//! Two entry points, one decision at the call site:
|
||||
//!
|
||||
//! - [`deploy`] — run the Score's interpret, no smoke.
|
||||
//! - [`deploy_with_smoke`] — run the Score's interpret, then the
|
||||
//! paired [`SmokeTest`]. Returns only after the smoke run
|
||||
//! completes; a failed smoke is a deploy error per the project
|
||||
//! rule "deploy blocks on smoke-test".
|
||||
//!
|
||||
//! Callers (CLI binaries, the e2e harness, the operator's
|
||||
//! reconciler) choose which to call. There is no global config knob
|
||||
//! and no per-Score override — the choice is one function call.
|
||||
|
||||
use harmony::interpret::{InterpretError, Outcome};
|
||||
use harmony::inventory::Inventory;
|
||||
use harmony::score::Score;
|
||||
use harmony::topology::Topology;
|
||||
use thiserror::Error;
|
||||
use tracing::{info, info_span};
|
||||
|
||||
use super::contract::{SmokeAssemblyError, SmokeTest};
|
||||
use super::suite::SmokeReport;
|
||||
|
||||
/// What [`deploy`]/[`deploy_with_smoke`] return on success.
|
||||
///
|
||||
/// `interpret` is the underlying `Score::interpret` result —
|
||||
/// preserved verbatim so callers that previously consumed `Outcome`
|
||||
/// directly continue to work after switching to these wrappers.
|
||||
/// `smoke` is `None` when the no-smoke variant was used; `Some`
|
||||
/// otherwise, with `passed() == true` guaranteed (a failed smoke is
|
||||
/// a [`DeployError::SmokeFailed`]).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct DeployOutcome {
|
||||
pub interpret: Outcome,
|
||||
pub smoke: Option<SmokeReport>,
|
||||
}
|
||||
|
||||
/// All the ways a deploy can fail. Three arms, each pointing at a
|
||||
/// concrete cause the operator can act on:
|
||||
///
|
||||
/// - `Interpret` — the Score's own work failed (helm error, kube
|
||||
/// apply error, etc). Same shape callers got before smoke existed.
|
||||
/// - `SmokeAssembly` — the Score deployed, but its `SmokeTest`
|
||||
/// couldn't even *build* a suite. Typically means the test's
|
||||
/// author is missing an endpoint reference; an authoring bug, not
|
||||
/// a runtime bug.
|
||||
/// - `SmokeFailed` — the Score deployed and the suite built, but
|
||||
/// one or more probes failed. The full [`SmokeReport`] is
|
||||
/// attached so the dashboard can render which stages passed and
|
||||
/// which didn't, instead of a single opaque "deploy failed".
|
||||
#[derive(Debug, Error)]
|
||||
pub enum DeployError {
|
||||
#[error("score interpret failed: {0}")]
|
||||
Interpret(#[from] InterpretError),
|
||||
#[error("smoke assembly failed: {0}")]
|
||||
SmokeAssembly(#[from] SmokeAssemblyError),
|
||||
#[error("smoke test failed ({} of {} probe(s) did not pass)", .report.failed().count(), .report.probes.len())]
|
||||
SmokeFailed {
|
||||
/// The Score deployed cleanly — `interpret` is preserved so
|
||||
/// callers (and the dashboard) can render the underlying
|
||||
/// success message alongside the smoke failure.
|
||||
interpret: Outcome,
|
||||
report: SmokeReport,
|
||||
},
|
||||
}
|
||||
|
||||
/// Deploy a Score without running any smoke checks.
|
||||
///
|
||||
/// Returns immediately after `score.interpret` returns. Equivalent
|
||||
/// to calling `score.interpret(...)` directly — provided here so
|
||||
/// callers can switch between smoke / no-smoke without changing
|
||||
/// the surrounding control flow.
|
||||
pub async fn deploy<T, S>(
|
||||
score: S,
|
||||
inventory: &Inventory,
|
||||
topology: &T,
|
||||
) -> Result<DeployOutcome, DeployError>
|
||||
where
|
||||
T: Topology,
|
||||
S: Score<T>,
|
||||
{
|
||||
let span = info_span!("deploy", score = score.name());
|
||||
let _g = span.enter();
|
||||
info!("deploy: starting interpret (no smoke)");
|
||||
let interpret = score.interpret(inventory, topology).await?;
|
||||
info!(status = %interpret.status, "deploy: interpret returned");
|
||||
Ok(DeployOutcome {
|
||||
interpret,
|
||||
smoke: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Deploy a Score, then run its paired SmokeTest. Block on the
|
||||
/// suite — see project rule "deploy blocks on smoke-test".
|
||||
///
|
||||
/// The type bound `SM: SmokeTest<T, Score = S>` is where the
|
||||
/// compile-time pairing lives: the `SmokeTest` you pass must
|
||||
/// declare *this* Score type as its associated `Score`. A mismatch
|
||||
/// is rejected at the call site, not at runtime. (See ADR-024 §2
|
||||
/// and JG's compile-time-feedback principle.)
|
||||
pub async fn deploy_with_smoke<T, S, SM>(
|
||||
score: S,
|
||||
smoke: SM,
|
||||
inventory: &Inventory,
|
||||
topology: &T,
|
||||
) -> Result<DeployOutcome, DeployError>
|
||||
where
|
||||
T: Topology,
|
||||
S: Score<T>,
|
||||
SM: SmokeTest<T, Score = S>,
|
||||
{
|
||||
let span = info_span!("deploy", score = score.name(), smoke = true);
|
||||
let _g = span.enter();
|
||||
|
||||
info!("deploy: starting interpret");
|
||||
let interpret = score.interpret(inventory, topology).await?;
|
||||
info!(status = %interpret.status, "deploy: interpret returned, assembling smoke");
|
||||
|
||||
let suite = smoke.assemble(&score, topology).await?;
|
||||
info!(stages = suite.len(), "deploy: smoke suite assembled");
|
||||
|
||||
let report = suite.run(topology).await;
|
||||
info!(
|
||||
passed = report.passed(),
|
||||
elapsed_ms = report.elapsed.as_millis() as u64,
|
||||
"deploy: smoke finished",
|
||||
);
|
||||
|
||||
if !report.passed() {
|
||||
return Err(DeployError::SmokeFailed { interpret, report });
|
||||
}
|
||||
|
||||
Ok(DeployOutcome {
|
||||
interpret,
|
||||
smoke: Some(report),
|
||||
})
|
||||
}
|
||||
|
||||
// ---- tests -----------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::companion::smoke::probe::{Probe, ProbeAttempt, ProbeName, RetryPolicy};
|
||||
use crate::companion::smoke::suite::SmokeSuite;
|
||||
use async_trait::async_trait;
|
||||
use harmony::data::Version;
|
||||
use harmony::interpret::{Interpret, InterpretName, InterpretStatus};
|
||||
use harmony::topology::PreparationOutcome;
|
||||
use harmony_types::id::Id;
|
||||
use serde::Serialize;
|
||||
use std::sync::Mutex;
|
||||
|
||||
// -- test fixtures: a NoopTopology + minimal Score + minimal Probe --------
|
||||
|
||||
#[derive(Debug)]
|
||||
struct NoopTopology;
|
||||
|
||||
#[async_trait]
|
||||
impl Topology for NoopTopology {
|
||||
fn name(&self) -> &str {
|
||||
"noop"
|
||||
}
|
||||
async fn ensure_ready(
|
||||
&self,
|
||||
) -> Result<PreparationOutcome, harmony::topology::PreparationError> {
|
||||
Ok(PreparationOutcome::Noop)
|
||||
}
|
||||
}
|
||||
|
||||
/// Minimal Score whose interpret can be scripted to succeed or
|
||||
/// fail. Demonstrates that `deploy_with_smoke` short-circuits
|
||||
/// correctly when interpret fails.
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
struct TrivialScore {
|
||||
label: String,
|
||||
should_fail: bool,
|
||||
}
|
||||
|
||||
impl<T: Topology> Score<T> for TrivialScore {
|
||||
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
|
||||
Box::new(TrivialInterpret {
|
||||
should_fail: self.should_fail,
|
||||
label: self.label.clone(),
|
||||
})
|
||||
}
|
||||
fn name(&self) -> String {
|
||||
format!("TrivialScore({})", self.label)
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
struct TrivialInterpret {
|
||||
should_fail: bool,
|
||||
label: String,
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Topology> Interpret<T> for TrivialInterpret {
|
||||
async fn execute(
|
||||
&self,
|
||||
_inventory: &Inventory,
|
||||
_topology: &T,
|
||||
) -> Result<Outcome, InterpretError> {
|
||||
if self.should_fail {
|
||||
Err(InterpretError::new(format!(
|
||||
"scripted failure: {}",
|
||||
self.label
|
||||
)))
|
||||
} else {
|
||||
Ok(Outcome::success(format!("deployed {}", self.label)))
|
||||
}
|
||||
}
|
||||
fn get_name(&self) -> InterpretName {
|
||||
InterpretName::Custom("TrivialInterpret")
|
||||
}
|
||||
fn get_version(&self) -> Version {
|
||||
Version::from("0.0.0").unwrap()
|
||||
}
|
||||
fn get_status(&self) -> InterpretStatus {
|
||||
InterpretStatus::QUEUED
|
||||
}
|
||||
fn get_children(&self) -> Vec<Id> {
|
||||
vec![]
|
||||
}
|
||||
}
|
||||
|
||||
/// A probe whose every attempt is hardcoded. Lets us script
|
||||
/// "this stage passes" or "this stage rejects" in tests.
|
||||
#[derive(Debug)]
|
||||
struct Always {
|
||||
name: ProbeName,
|
||||
attempt: ProbeAttempt,
|
||||
}
|
||||
impl Always {
|
||||
fn new(name: &str, attempt: ProbeAttempt) -> Self {
|
||||
Self {
|
||||
name: ProbeName::try_new(name).unwrap(),
|
||||
attempt,
|
||||
}
|
||||
}
|
||||
}
|
||||
#[async_trait]
|
||||
impl<T: Topology> Probe<T> for Always {
|
||||
fn name(&self) -> &ProbeName {
|
||||
&self.name
|
||||
}
|
||||
async fn check(&self, _t: &T) -> ProbeAttempt {
|
||||
self.attempt.clone()
|
||||
}
|
||||
}
|
||||
|
||||
/// The shape we expect every fleet smoke test to follow: it
|
||||
/// owns no state beyond what the Score gives it.
|
||||
#[derive(Debug)]
|
||||
struct TrivialSmoke {
|
||||
outcomes: Mutex<Vec<ProbeAttempt>>,
|
||||
}
|
||||
impl TrivialSmoke {
|
||||
fn new(outcomes: Vec<ProbeAttempt>) -> Self {
|
||||
Self {
|
||||
outcomes: Mutex::new(outcomes),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Topology> SmokeTest<T> for TrivialSmoke {
|
||||
type Score = TrivialScore;
|
||||
async fn assemble(
|
||||
&self,
|
||||
score: &Self::Score,
|
||||
_topology: &T,
|
||||
) -> Result<SmokeSuite<T>, SmokeAssemblyError> {
|
||||
let mut suite = SmokeSuite::new();
|
||||
for (i, attempt) in self.outcomes.lock().unwrap().iter().enumerate() {
|
||||
let name = format!("{}-probe-{i}", score.label);
|
||||
suite = suite.stage(Always::new(&name, attempt.clone()), RetryPolicy::once());
|
||||
}
|
||||
Ok(suite)
|
||||
}
|
||||
}
|
||||
|
||||
// -- the actual contract tests --------------------------------------------
|
||||
|
||||
fn inv() -> Inventory {
|
||||
Inventory::empty()
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn deploy_without_smoke_returns_outcome_and_no_report() {
|
||||
let s = TrivialScore {
|
||||
label: "alpha".to_string(),
|
||||
should_fail: false,
|
||||
};
|
||||
let r = deploy(s, &inv(), &NoopTopology).await.expect("deploy ok");
|
||||
assert_eq!(r.interpret.status, InterpretStatus::SUCCESS);
|
||||
assert!(
|
||||
r.smoke.is_none(),
|
||||
"no-smoke variant must not synthesize a report"
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn interpret_failure_short_circuits_before_smoke_assembly() {
|
||||
// If interpret fails, we MUST NOT call `smoke.assemble` —
|
||||
// there is nothing deployed to verify, and running smoke
|
||||
// would produce confusing cascade failures in the report.
|
||||
let s = TrivialScore {
|
||||
label: "beta".to_string(),
|
||||
should_fail: true,
|
||||
};
|
||||
// Use a smoke that would panic if assembled — proves we
|
||||
// never got there.
|
||||
struct PanicSmoke;
|
||||
#[async_trait]
|
||||
impl<T: Topology> SmokeTest<T> for PanicSmoke {
|
||||
type Score = TrivialScore;
|
||||
async fn assemble(
|
||||
&self,
|
||||
_: &Self::Score,
|
||||
_: &T,
|
||||
) -> Result<SmokeSuite<T>, SmokeAssemblyError> {
|
||||
panic!("assemble must not be called when interpret fails");
|
||||
}
|
||||
}
|
||||
match deploy_with_smoke(s, PanicSmoke, &inv(), &NoopTopology).await {
|
||||
Err(DeployError::Interpret(e)) => {
|
||||
assert!(format!("{e}").contains("scripted failure"));
|
||||
}
|
||||
other => panic!("expected Interpret error, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn all_probes_pass_yields_deployoutcome_with_report() {
|
||||
let s = TrivialScore {
|
||||
label: "gamma".to_string(),
|
||||
should_fail: false,
|
||||
};
|
||||
let smoke = TrivialSmoke::new(vec![
|
||||
ProbeAttempt::Ok(Some("first stage ok".to_string())),
|
||||
ProbeAttempt::Ok(None),
|
||||
]);
|
||||
let r = deploy_with_smoke(s, smoke, &inv(), &NoopTopology)
|
||||
.await
|
||||
.expect("deploy + smoke ok");
|
||||
let report = r.smoke.expect("smoke report present when smoke runs");
|
||||
assert!(report.passed());
|
||||
assert_eq!(report.probes.len(), 2);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn probe_failure_blocks_deploy() {
|
||||
let s = TrivialScore {
|
||||
label: "delta".to_string(),
|
||||
should_fail: false,
|
||||
};
|
||||
let smoke = TrivialSmoke::new(vec![
|
||||
ProbeAttempt::Ok(None),
|
||||
ProbeAttempt::Fatal("definitely broken".to_string()),
|
||||
]);
|
||||
match deploy_with_smoke(s, smoke, &inv(), &NoopTopology).await {
|
||||
Err(DeployError::SmokeFailed { interpret, report }) => {
|
||||
// The Score itself succeeded — that's preserved, so
|
||||
// the dashboard can show "deploy converged, smoke
|
||||
// disagrees".
|
||||
assert_eq!(interpret.status, InterpretStatus::SUCCESS);
|
||||
assert!(!report.passed());
|
||||
assert_eq!(report.failed().count(), 1);
|
||||
assert_eq!(report.probes.last().unwrap().name.as_str(), "delta-probe-1");
|
||||
}
|
||||
other => panic!("expected SmokeFailed, got {other:?}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
62
fleet/harmony-fleet-deploy/src/companion/smoke/mod.rs
Normal file
62
fleet/harmony-fleet-deploy/src/companion/smoke/mod.rs
Normal file
@@ -0,0 +1,62 @@
|
||||
//! Smoke-test contract — the second Score companion (ADR-023 P7,
|
||||
//! ADR-023 P4 "deploy returns only after smoke-test success").
|
||||
//!
|
||||
//! ## The big picture
|
||||
//!
|
||||
//! ```text
|
||||
//! ┌────────────┐ pair ┌──────────────┐ assemble ┌────────────┐
|
||||
//! │ *Score │◀─type─────│ SmokeTest │─────────────▶│ SmokeSuite │
|
||||
//! └────────────┘ locked └──────────────┘ └────────────┘
|
||||
//! ▲ │
|
||||
//! │ │
|
||||
//! deploy_with_smoke ─── interpret ─── then ── run probes ──▶ SmokeReport
|
||||
//! ```
|
||||
//!
|
||||
//! - [`Probe`] is the single-attempt unit, classifying its
|
||||
//! observation as [`Ok`/`Retry`/`Fatal`](probe::ProbeAttempt). It's
|
||||
//! also the stage the dashboard renders.
|
||||
//! - [`SmokeSuite`] is an ordered sequence of probes with per-stage
|
||||
//! [`RetryPolicy`]. Sequential by design; parallel stages deferred.
|
||||
//! - [`SmokeTest`] pairs a [`Score`](harmony::score::Score) with a
|
||||
//! suite by associated type — a `SmokeTest<Score = FleetOperatorScore>`
|
||||
//! cannot be passed to `deploy_with_smoke(DnsScore, …)` because the
|
||||
//! compiler refuses the mismatch.
|
||||
//! - [`deploy`] / [`deploy_with_smoke`] are free functions that
|
||||
//! bind a Score's interpret to an optional SmokeTest. Returns
|
||||
//! only after the smoke run completes — see
|
||||
//! [`DeployError::SmokeFailed`] for the failure shape.
|
||||
//!
|
||||
//! ## Where this lives
|
||||
//!
|
||||
//! Inside `harmony-fleet-deploy` for Phase 0, alongside the
|
||||
//! existing [`AgentObservation`](super::AgentObservation) companion.
|
||||
//! When a second consumer (PostgreSQL deploy, monitoring stack)
|
||||
//! adopts this contract, the module promotes to a top-level
|
||||
//! `harmony-smoke` crate with no API churn — the move is mechanical
|
||||
//! because nothing in here depends on fleet types.
|
||||
//!
|
||||
//! ## What this does NOT do
|
||||
//!
|
||||
//! - **No new `HarmonyEvent` variants in Phase 0.** Stage progress
|
||||
//! is emitted via `tracing::info!` events inside `info_span!`s.
|
||||
//! A later PR can add `HarmonyEvent::SmokeStage{Started,Finished}`
|
||||
//! if the dashboard wants the typed seam.
|
||||
//! - **No `Score`/`Interpret` trait edits.** Per ADR-023 P7. The
|
||||
//! smoke seam is purely companion-shaped.
|
||||
//! - **No CLI flag yet.** Phase 1 wires `harmony-fleet-deploy <cmd>
|
||||
//! --no-smoke` once a real `FleetOperatorSmokeTest` exists.
|
||||
|
||||
pub mod contract;
|
||||
pub mod deploy;
|
||||
pub mod probe;
|
||||
pub mod probes;
|
||||
pub mod suite;
|
||||
|
||||
pub use contract::{SmokeAssemblyError, SmokeTest};
|
||||
pub use deploy::{DeployError, DeployOutcome, deploy, deploy_with_smoke};
|
||||
pub use probe::{
|
||||
InvalidProbeName, Probe, ProbeAttempt, ProbeFailure, ProbeName, ProbeOutcome, RetryPolicy,
|
||||
run_probe,
|
||||
};
|
||||
pub use probes::TcpReachable;
|
||||
pub use suite::{ProbeReport, SmokeReport, SmokeSuite};
|
||||
515
fleet/harmony-fleet-deploy/src/companion/smoke/probe.rs
Normal file
515
fleet/harmony-fleet-deploy/src/companion/smoke/probe.rs
Normal file
@@ -0,0 +1,515 @@
|
||||
//! Probe contract — the single-attempt unit of a smoke run.
|
||||
//!
|
||||
//! A [`Probe`] is what gets visualized as one stage of the smoke
|
||||
//! pipeline. Each implementation answers one question against a
|
||||
//! deployed system: "is this port reachable", "did this pod become
|
||||
//! ready", "is this KV key set to the value we expect".
|
||||
//!
|
||||
//! The retry loop is **not** a probe responsibility. A probe's
|
||||
//! [`check`](Probe::check) returns one of three classifications via
|
||||
//! [`ProbeAttempt`] — `Ok`, `Retry`, or `Fatal` — and [`run_probe`]
|
||||
//! drives the polling timer + timeout budget around it. This keeps
|
||||
//! probes single-responsibility and makes their unit tests
|
||||
//! deterministic.
|
||||
//!
|
||||
//! Each public type here is cardinality-matched to the concept it
|
||||
//! represents (see ADR-024 §2 and JG's *For the Love of Compilers*
|
||||
//! talk). `ProbeName` is a validated newtype, not `String`, because
|
||||
//! "an empty probe name" or "a probe name with a newline in it"
|
||||
//! cannot mean anything useful and would corrupt the dashboard
|
||||
//! rendering. `ProbeAttempt` is a three-arm sum, not `Result<(), Error>`,
|
||||
//! because "not ready yet" and "definitively no" must drive different
|
||||
//! orchestration paths.
|
||||
|
||||
use std::fmt;
|
||||
use std::time::{Duration, Instant};
|
||||
|
||||
use async_trait::async_trait;
|
||||
use thiserror::Error;
|
||||
use tracing::{debug, warn};
|
||||
|
||||
use harmony::topology::Topology;
|
||||
|
||||
/// One attempted check by a probe.
|
||||
///
|
||||
/// The probe classifies its own observation rather than letting the
|
||||
/// orchestrator guess: a 5xx response is fundamentally different from
|
||||
/// a connection refused, even if both look like "failure" to a naive
|
||||
/// caller. The orchestrator ([`run_probe`]) uses this classification
|
||||
/// to decide whether to keep polling or bail out.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum ProbeAttempt {
|
||||
/// The probe's invariant holds *right now*. Optional detail is
|
||||
/// surfaced in the smoke report (e.g. "HTTP 200 in 12 ms",
|
||||
/// "found 3 ready pods").
|
||||
Ok(Option<String>),
|
||||
/// The probe didn't observe what it's looking for yet, but the
|
||||
/// state is consistent with eventual success. Keep polling until
|
||||
/// the timeout. The string is the most recent observation, kept
|
||||
/// so that a final timeout can quote the last thing we saw.
|
||||
Retry(String),
|
||||
/// The probe got a definitive negative answer — no amount of
|
||||
/// further polling will change it. Skip the rest of the budget
|
||||
/// and report failure now. Example: HTTP 404 from a URL that
|
||||
/// must exist after deploy, or a `Service` with the wrong
|
||||
/// selector.
|
||||
Fatal(String),
|
||||
}
|
||||
|
||||
/// Result of a full probe run after [`run_probe`] has consumed the
|
||||
/// retry budget (or short-circuited).
|
||||
///
|
||||
/// This is the per-probe shape that lands in
|
||||
/// [`crate::companion::smoke::SmokeReport`]. The orchestrator wraps
|
||||
/// it with the [`ProbeName`] and the elapsed time so the dashboard
|
||||
/// can render each stage independently.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum ProbeOutcome {
|
||||
/// At least one attempt returned [`ProbeAttempt::Ok`] within the
|
||||
/// retry budget.
|
||||
Pass {
|
||||
elapsed: Duration,
|
||||
attempts: u32,
|
||||
detail: Option<String>,
|
||||
},
|
||||
/// Either we exhausted the budget with only `Retry`s (timeout)
|
||||
/// or a single `Fatal` short-circuited us (rejected).
|
||||
Fail {
|
||||
elapsed: Duration,
|
||||
attempts: u32,
|
||||
failure: ProbeFailure,
|
||||
},
|
||||
}
|
||||
|
||||
impl ProbeOutcome {
|
||||
pub fn passed(&self) -> bool {
|
||||
matches!(self, Self::Pass { .. })
|
||||
}
|
||||
}
|
||||
|
||||
/// Why a probe failed. Two arms, deliberately — they map to two
|
||||
/// different operator actions: "I waited and nothing happened" needs
|
||||
/// a longer budget or a stuck-component investigation; "something
|
||||
/// gave me a hard no" needs a configuration fix.
|
||||
#[derive(Debug, Clone, PartialEq, Eq, Error)]
|
||||
pub enum ProbeFailure {
|
||||
#[error("timed out after {budget:?} (last observation: {last_observation})")]
|
||||
Timeout {
|
||||
budget: Duration,
|
||||
last_observation: String,
|
||||
},
|
||||
#[error("rejected: {detail}")]
|
||||
Rejected { detail: String },
|
||||
}
|
||||
|
||||
/// Polling parameters for one probe.
|
||||
///
|
||||
/// `interval` is the wait between attempts. `timeout` is the total
|
||||
/// wall-clock budget from the first attempt to the last; the loop
|
||||
/// will not start a new attempt past this budget but will let an
|
||||
/// in-flight one finish.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub struct RetryPolicy {
|
||||
interval: Duration,
|
||||
timeout: Duration,
|
||||
}
|
||||
|
||||
impl RetryPolicy {
|
||||
/// Standard polling shape: try, wait `interval`, try again, until
|
||||
/// either an `Ok`/`Fatal` happens or `timeout` elapses.
|
||||
///
|
||||
/// Panics if `interval` is zero — a zero interval combined with
|
||||
/// any non-trivial check would spin the executor and produce no
|
||||
/// useful retry semantics. Express "run once" via [`Self::once`]
|
||||
/// instead.
|
||||
pub fn polling(interval: Duration, timeout: Duration) -> Self {
|
||||
assert!(
|
||||
!interval.is_zero(),
|
||||
"RetryPolicy::polling requires a non-zero interval; use RetryPolicy::once for a single attempt",
|
||||
);
|
||||
Self { interval, timeout }
|
||||
}
|
||||
|
||||
/// Single-shot: one attempt, no retries. Useful for probes whose
|
||||
/// success criterion is binary the moment we check (e.g. "is
|
||||
/// this string equal to that string").
|
||||
pub fn once() -> Self {
|
||||
Self {
|
||||
interval: Duration::from_millis(1),
|
||||
timeout: Duration::ZERO,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn interval(&self) -> Duration {
|
||||
self.interval
|
||||
}
|
||||
|
||||
pub fn timeout(&self) -> Duration {
|
||||
self.timeout
|
||||
}
|
||||
}
|
||||
|
||||
/// A validated probe name.
|
||||
///
|
||||
/// Probe names appear in `tracing` events, error messages, the smoke
|
||||
/// report, and the eventual dashboard. Constraining them at
|
||||
/// construction means the dashboard never has to defend against
|
||||
/// surprising bytes (control chars, newlines, etc).
|
||||
#[derive(Debug, Clone, Hash, PartialEq, Eq, PartialOrd, Ord)]
|
||||
pub struct ProbeName(String);
|
||||
|
||||
#[derive(Debug, Error, PartialEq, Eq)]
|
||||
pub enum InvalidProbeName {
|
||||
#[error("probe name must not be empty")]
|
||||
Empty,
|
||||
#[error("probe name must not exceed 128 bytes (got {0})")]
|
||||
TooLong(usize),
|
||||
#[error("probe name must not contain control characters")]
|
||||
ControlChar,
|
||||
}
|
||||
|
||||
impl ProbeName {
|
||||
pub fn try_new(s: impl Into<String>) -> Result<Self, InvalidProbeName> {
|
||||
let s = s.into();
|
||||
if s.is_empty() {
|
||||
return Err(InvalidProbeName::Empty);
|
||||
}
|
||||
if s.len() > 128 {
|
||||
return Err(InvalidProbeName::TooLong(s.len()));
|
||||
}
|
||||
if s.chars().any(|c| c.is_control()) {
|
||||
return Err(InvalidProbeName::ControlChar);
|
||||
}
|
||||
Ok(Self(s))
|
||||
}
|
||||
|
||||
pub fn as_str(&self) -> &str {
|
||||
&self.0
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for ProbeName {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.0)
|
||||
}
|
||||
}
|
||||
|
||||
/// One pipeline stage of a [`SmokeSuite`](super::SmokeSuite).
|
||||
///
|
||||
/// Object-safe so a suite can hold a heterogeneous `Vec<Box<dyn
|
||||
/// Probe<T>>>`. The associated retry behavior is supplied separately
|
||||
/// by the suite (via [`RetryPolicy`]) so that the same probe type
|
||||
/// can be reused with different budgets in different suites.
|
||||
#[async_trait]
|
||||
pub trait Probe<T: Topology>: Send + Sync + std::fmt::Debug {
|
||||
fn name(&self) -> &ProbeName;
|
||||
async fn check(&self, topology: &T) -> ProbeAttempt;
|
||||
}
|
||||
|
||||
/// Drive a probe through its retry policy and return the consolidated
|
||||
/// outcome.
|
||||
///
|
||||
/// Behavior:
|
||||
///
|
||||
/// - Every `interval` we run `probe.check(...)`. A successful attempt
|
||||
/// wins immediately; a `Fatal` short-circuits to `Rejected`; a
|
||||
/// `Retry` records its observation and we wait.
|
||||
/// - If `timeout` is `Duration::ZERO` we run exactly one attempt
|
||||
/// (the "once" shape).
|
||||
/// - Otherwise we keep attempting until the elapsed wall time
|
||||
/// exceeds `timeout`. The last `Retry` observation becomes the
|
||||
/// `Timeout::last_observation` field so the failure message tells
|
||||
/// the operator what we kept seeing.
|
||||
///
|
||||
/// The function emits a `tracing` event per attempt at `debug` level
|
||||
/// and one at `warn` for the final failure (so a Phase-1 dashboard
|
||||
/// can subscribe via a `tracing` layer without us needing to add a
|
||||
/// new `HarmonyEvent` variant yet).
|
||||
pub async fn run_probe<T, P>(probe: &P, topology: &T, policy: RetryPolicy) -> ProbeOutcome
|
||||
where
|
||||
T: Topology,
|
||||
P: Probe<T> + ?Sized,
|
||||
{
|
||||
let start = Instant::now();
|
||||
let mut attempts: u32 = 0;
|
||||
|
||||
loop {
|
||||
attempts += 1;
|
||||
// Each iteration produces either an early return (Ok / Fatal)
|
||||
// or the most recent Retry observation. Threading that
|
||||
// observation back into the next iteration only happens
|
||||
// implicitly via the timeout check below, so there's no
|
||||
// long-lived `last_observation` to lint over.
|
||||
let last_observation = match probe.check(topology).await {
|
||||
ProbeAttempt::Ok(detail) => {
|
||||
debug!(
|
||||
probe = %probe.name(),
|
||||
attempts,
|
||||
elapsed_ms = start.elapsed().as_millis() as u64,
|
||||
"probe passed",
|
||||
);
|
||||
return ProbeOutcome::Pass {
|
||||
elapsed: start.elapsed(),
|
||||
attempts,
|
||||
detail,
|
||||
};
|
||||
}
|
||||
ProbeAttempt::Fatal(detail) => {
|
||||
warn!(
|
||||
probe = %probe.name(),
|
||||
attempts,
|
||||
%detail,
|
||||
"probe rejected (fatal)",
|
||||
);
|
||||
return ProbeOutcome::Fail {
|
||||
elapsed: start.elapsed(),
|
||||
attempts,
|
||||
failure: ProbeFailure::Rejected { detail },
|
||||
};
|
||||
}
|
||||
ProbeAttempt::Retry(obs) => {
|
||||
debug!(
|
||||
probe = %probe.name(),
|
||||
attempts,
|
||||
observation = %obs,
|
||||
"probe retry",
|
||||
);
|
||||
obs
|
||||
}
|
||||
};
|
||||
|
||||
if start.elapsed() >= policy.timeout {
|
||||
warn!(
|
||||
probe = %probe.name(),
|
||||
attempts,
|
||||
budget_ms = policy.timeout.as_millis() as u64,
|
||||
last_observation = %last_observation,
|
||||
"probe timed out",
|
||||
);
|
||||
return ProbeOutcome::Fail {
|
||||
elapsed: start.elapsed(),
|
||||
attempts,
|
||||
failure: ProbeFailure::Timeout {
|
||||
budget: policy.timeout,
|
||||
last_observation,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
tokio::time::sleep(policy.interval).await;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use harmony::topology::PreparationOutcome;
|
||||
use std::sync::Mutex;
|
||||
use std::sync::atomic::{AtomicU32, Ordering};
|
||||
|
||||
// -- ProbeName -----------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn probe_name_accepts_normal_strings() {
|
||||
assert!(ProbeName::try_new("nats-reachable").is_ok());
|
||||
assert!(ProbeName::try_new("HTTP / health").is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_name_rejects_empty() {
|
||||
assert_eq!(ProbeName::try_new(""), Err(InvalidProbeName::Empty));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_name_rejects_control_chars() {
|
||||
// Dashboard renders these as garbage — fail at construction.
|
||||
assert_eq!(
|
||||
ProbeName::try_new("bad\nname"),
|
||||
Err(InvalidProbeName::ControlChar)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_name_rejects_overlong() {
|
||||
let long = "x".repeat(129);
|
||||
assert_eq!(
|
||||
ProbeName::try_new(long.clone()),
|
||||
Err(InvalidProbeName::TooLong(129))
|
||||
);
|
||||
}
|
||||
|
||||
// -- RetryPolicy ---------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn retry_policy_once_is_single_attempt() {
|
||||
let p = RetryPolicy::once();
|
||||
assert_eq!(p.timeout(), Duration::ZERO);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic(expected = "non-zero interval")]
|
||||
fn retry_policy_polling_rejects_zero_interval() {
|
||||
// Zero interval + non-trivial check is a spin loop. The
|
||||
// type system can't catch this, so a runtime assert at
|
||||
// construction is the next best thing — fail loud, fail
|
||||
// early, not on the operator's deploy log.
|
||||
let _ = RetryPolicy::polling(Duration::ZERO, Duration::from_secs(5));
|
||||
}
|
||||
|
||||
// -- run_probe orchestration --------------------------------------------
|
||||
|
||||
/// Test probe that returns a scripted sequence of attempts, then
|
||||
/// loops forever on the last one. Lets us simulate
|
||||
/// "retry-then-ok" without real network I/O.
|
||||
#[derive(Debug)]
|
||||
struct ScriptedProbe {
|
||||
name: ProbeName,
|
||||
script: Mutex<Vec<ProbeAttempt>>,
|
||||
calls: AtomicU32,
|
||||
}
|
||||
|
||||
impl ScriptedProbe {
|
||||
fn new(name: &str, script: Vec<ProbeAttempt>) -> Self {
|
||||
Self {
|
||||
name: ProbeName::try_new(name).unwrap(),
|
||||
script: Mutex::new(script),
|
||||
calls: AtomicU32::new(0),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Topology> Probe<T> for ScriptedProbe {
|
||||
fn name(&self) -> &ProbeName {
|
||||
&self.name
|
||||
}
|
||||
async fn check(&self, _t: &T) -> ProbeAttempt {
|
||||
self.calls.fetch_add(1, Ordering::SeqCst);
|
||||
let mut s = self.script.lock().unwrap();
|
||||
if s.len() == 1 {
|
||||
s[0].clone()
|
||||
} else {
|
||||
s.remove(0)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
struct NoopTopology;
|
||||
|
||||
#[async_trait]
|
||||
impl Topology for NoopTopology {
|
||||
fn name(&self) -> &str {
|
||||
"noop"
|
||||
}
|
||||
async fn ensure_ready(
|
||||
&self,
|
||||
) -> Result<PreparationOutcome, harmony::topology::PreparationError> {
|
||||
Ok(PreparationOutcome::Noop)
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn ok_on_first_attempt_yields_pass() {
|
||||
let probe = ScriptedProbe::new("instant", vec![ProbeAttempt::Ok(Some("hi".to_string()))]);
|
||||
let outcome = run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(1), Duration::from_millis(50)),
|
||||
)
|
||||
.await;
|
||||
match outcome {
|
||||
ProbeOutcome::Pass {
|
||||
attempts, detail, ..
|
||||
} => {
|
||||
assert_eq!(attempts, 1);
|
||||
assert_eq!(detail.as_deref(), Some("hi"));
|
||||
}
|
||||
other => panic!("expected Pass, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn retries_until_ok_within_budget() {
|
||||
let probe = ScriptedProbe::new(
|
||||
"eventual",
|
||||
vec![
|
||||
ProbeAttempt::Retry("not yet (1)".to_string()),
|
||||
ProbeAttempt::Retry("not yet (2)".to_string()),
|
||||
ProbeAttempt::Ok(None),
|
||||
],
|
||||
);
|
||||
let outcome = run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(2), Duration::from_millis(500)),
|
||||
)
|
||||
.await;
|
||||
match outcome {
|
||||
ProbeOutcome::Pass { attempts, .. } => assert_eq!(attempts, 3),
|
||||
other => panic!("expected Pass after retries, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fatal_short_circuits_without_consuming_full_budget() {
|
||||
let probe = ScriptedProbe::new(
|
||||
"definite-no",
|
||||
vec![ProbeAttempt::Fatal("wrong selector".to_string())],
|
||||
);
|
||||
let start = Instant::now();
|
||||
let outcome = run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(5), Duration::from_secs(10)),
|
||||
)
|
||||
.await;
|
||||
// The whole point of Fatal is that we don't burn the budget.
|
||||
// Generous upper bound — CI variance, not a real check.
|
||||
assert!(
|
||||
start.elapsed() < Duration::from_secs(1),
|
||||
"Fatal should return immediately, took {:?}",
|
||||
start.elapsed()
|
||||
);
|
||||
match outcome {
|
||||
ProbeOutcome::Fail {
|
||||
failure: ProbeFailure::Rejected { detail },
|
||||
..
|
||||
} => assert_eq!(detail, "wrong selector"),
|
||||
other => panic!("expected Rejected, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn always_retry_yields_timeout_with_last_observation() {
|
||||
let probe = ScriptedProbe::new(
|
||||
"never-ready",
|
||||
vec![ProbeAttempt::Retry("still pending".to_string())],
|
||||
);
|
||||
let outcome = run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(5), Duration::from_millis(40)),
|
||||
)
|
||||
.await;
|
||||
match outcome {
|
||||
ProbeOutcome::Fail {
|
||||
failure:
|
||||
ProbeFailure::Timeout {
|
||||
budget,
|
||||
last_observation,
|
||||
},
|
||||
attempts,
|
||||
..
|
||||
} => {
|
||||
assert_eq!(budget, Duration::from_millis(40));
|
||||
assert_eq!(last_observation, "still pending");
|
||||
assert!(
|
||||
attempts >= 2,
|
||||
"expected at least two attempts, got {attempts}"
|
||||
);
|
||||
}
|
||||
other => panic!("expected Timeout, got {other:?}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
15
fleet/harmony-fleet-deploy/src/companion/smoke/probes/mod.rs
Normal file
15
fleet/harmony-fleet-deploy/src/companion/smoke/probes/mod.rs
Normal file
@@ -0,0 +1,15 @@
|
||||
//! First-party [`Probe`](super::Probe) implementations.
|
||||
//!
|
||||
//! Phase 0 ships exactly one — [`TcpReachable`] — to validate the
|
||||
//! trait shape against real network I/O. Phase 1 adds HTTP, K8s
|
||||
//! pod-ready, NATS KV. Each future probe is one self-contained
|
||||
//! file under this module.
|
||||
//!
|
||||
//! Probes that need their own crate dependencies (e.g. a future
|
||||
//! `HelmReleaseHealthy` that talks to the helm client) can be
|
||||
//! gated behind a Cargo feature here without touching the core
|
||||
//! smoke types.
|
||||
|
||||
pub mod tcp;
|
||||
|
||||
pub use tcp::TcpReachable;
|
||||
225
fleet/harmony-fleet-deploy/src/companion/smoke/probes/tcp.rs
Normal file
225
fleet/harmony-fleet-deploy/src/companion/smoke/probes/tcp.rs
Normal file
@@ -0,0 +1,225 @@
|
||||
//! `TcpReachable` — the smallest useful probe: open a TCP connection
|
||||
//! to `host:port` within a timeout.
|
||||
//!
|
||||
//! Used as the first stage of practically every fleet smoke suite:
|
||||
//! "before checking the operator's `/healthz`, prove the cluster IP
|
||||
//! even routes". A connect-refused is a `Retry` (we may have caught
|
||||
//! the service mid-startup); a DNS or address-syntax error is
|
||||
//! `Fatal` (no amount of polling will fix it).
|
||||
//!
|
||||
//! Implemented with `tokio::net::TcpStream`. No new crate
|
||||
//! dependencies — `tokio` is already pulled in with the `full`
|
||||
//! feature by this crate's `Cargo.toml`.
|
||||
|
||||
use std::io;
|
||||
use std::time::Duration;
|
||||
|
||||
use async_trait::async_trait;
|
||||
use tokio::net::TcpStream;
|
||||
|
||||
use harmony::topology::Topology;
|
||||
|
||||
use crate::companion::smoke::probe::{Probe, ProbeAttempt, ProbeName};
|
||||
|
||||
/// Probe that connects to `host:port` and immediately drops the
|
||||
/// socket. Pass on `connect succeeded`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct TcpReachable {
|
||||
name: ProbeName,
|
||||
address: String,
|
||||
connect_timeout: Duration,
|
||||
}
|
||||
|
||||
impl TcpReachable {
|
||||
/// Default connect timeout is 1s — short enough that a polling
|
||||
/// retry sees several attempts; long enough that a slow host
|
||||
/// doesn't false-fail. Override with
|
||||
/// [`Self::with_connect_timeout`].
|
||||
pub const DEFAULT_CONNECT_TIMEOUT: Duration = Duration::from_secs(1);
|
||||
|
||||
pub fn new(name: ProbeName, address: impl Into<String>) -> Self {
|
||||
Self {
|
||||
name,
|
||||
address: address.into(),
|
||||
connect_timeout: Self::DEFAULT_CONNECT_TIMEOUT,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn with_connect_timeout(mut self, t: Duration) -> Self {
|
||||
self.connect_timeout = t;
|
||||
self
|
||||
}
|
||||
|
||||
pub fn address(&self) -> &str {
|
||||
&self.address
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Topology> Probe<T> for TcpReachable {
|
||||
fn name(&self) -> &ProbeName {
|
||||
&self.name
|
||||
}
|
||||
|
||||
async fn check(&self, _topology: &T) -> ProbeAttempt {
|
||||
let connect = TcpStream::connect(self.address.as_str());
|
||||
match tokio::time::timeout(self.connect_timeout, connect).await {
|
||||
Ok(Ok(_stream)) => ProbeAttempt::Ok(Some(format!("connected to {}", self.address))),
|
||||
Ok(Err(e)) => classify_connect_error(&self.address, e),
|
||||
Err(_elapsed) => ProbeAttempt::Retry(format!(
|
||||
"connect to {} timed out after {:?}",
|
||||
self.address, self.connect_timeout
|
||||
)),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Most connect errors look like "not ready yet" — we want to keep
|
||||
/// polling. A small set is definitive and should short-circuit the
|
||||
/// retry budget. `InvalidInput` and `Unsupported` both mean the
|
||||
/// address is unparseable / unroutable on this host — no number of
|
||||
/// retries will fix that.
|
||||
fn classify_connect_error(address: &str, e: io::Error) -> ProbeAttempt {
|
||||
match e.kind() {
|
||||
io::ErrorKind::InvalidInput => {
|
||||
ProbeAttempt::Fatal(format!("invalid address {address}: {e}"))
|
||||
}
|
||||
io::ErrorKind::Unsupported => {
|
||||
ProbeAttempt::Fatal(format!("unsupported address {address}: {e}"))
|
||||
}
|
||||
// ConnectionRefused / ConnectionReset / TimedOut /
|
||||
// HostUnreachable / NetworkUnreachable / NotFound (rare) —
|
||||
// all consistent with "service not up yet". Keep polling.
|
||||
_ => ProbeAttempt::Retry(format!("connect to {address}: {e}")),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::companion::smoke::probe::{ProbeFailure, ProbeOutcome, RetryPolicy, run_probe};
|
||||
use async_trait::async_trait;
|
||||
use harmony::topology::PreparationOutcome;
|
||||
use tokio::net::TcpListener;
|
||||
|
||||
#[derive(Debug)]
|
||||
struct NoopTopology;
|
||||
|
||||
#[async_trait]
|
||||
impl Topology for NoopTopology {
|
||||
fn name(&self) -> &str {
|
||||
"noop"
|
||||
}
|
||||
async fn ensure_ready(
|
||||
&self,
|
||||
) -> Result<PreparationOutcome, harmony::topology::PreparationError> {
|
||||
Ok(PreparationOutcome::Noop)
|
||||
}
|
||||
}
|
||||
|
||||
fn name(s: &str) -> ProbeName {
|
||||
ProbeName::try_new(s).unwrap()
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn passes_against_a_bound_listener() {
|
||||
// Bind to a kernel-assigned port and prove the probe sees
|
||||
// the listener. Using 127.0.0.1 keeps this hermetic — no
|
||||
// outbound network, no flakiness from name resolution.
|
||||
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
|
||||
let addr = listener.local_addr().unwrap();
|
||||
let probe = TcpReachable::new(name("local-listener"), addr.to_string());
|
||||
match run_probe(&probe, &NoopTopology, RetryPolicy::once()).await {
|
||||
ProbeOutcome::Pass { attempts, .. } => assert_eq!(attempts, 1),
|
||||
other => panic!("expected pass, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn retries_then_passes_when_listener_appears_late() {
|
||||
// Pick a free port, then race: probe runs immediately, then
|
||||
// we bind 30 ms later. The polling retry should catch it.
|
||||
let pre = TcpListener::bind("127.0.0.1:0").await.unwrap();
|
||||
let addr = pre.local_addr().unwrap();
|
||||
drop(pre); // free the port
|
||||
|
||||
let probe = TcpReachable::new(name("late-listener"), addr.to_string())
|
||||
.with_connect_timeout(Duration::from_millis(50));
|
||||
|
||||
let (probe_result, _) = tokio::join!(
|
||||
run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(10), Duration::from_secs(2)),
|
||||
),
|
||||
async {
|
||||
tokio::time::sleep(Duration::from_millis(40)).await;
|
||||
let listener = TcpListener::bind(addr).await.expect("rebind");
|
||||
// Hold the listener until the probe is satisfied.
|
||||
tokio::time::sleep(Duration::from_millis(500)).await;
|
||||
drop(listener);
|
||||
}
|
||||
);
|
||||
|
||||
match probe_result {
|
||||
ProbeOutcome::Pass { attempts, .. } => {
|
||||
assert!(
|
||||
attempts >= 2,
|
||||
"expected at least 2 attempts, got {attempts}"
|
||||
);
|
||||
}
|
||||
other => panic!("expected eventual pass, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn times_out_on_unused_port() {
|
||||
// Bind+drop to confirm the port is free in the kernel's
|
||||
// sense, then probe against it with a tight budget. With
|
||||
// nothing listening, the OS returns ECONNREFUSED, which we
|
||||
// classify as Retry → budget expiry → Timeout.
|
||||
let pre = TcpListener::bind("127.0.0.1:0").await.unwrap();
|
||||
let addr = pre.local_addr().unwrap();
|
||||
drop(pre);
|
||||
|
||||
let probe = TcpReachable::new(name("nothing-here"), addr.to_string())
|
||||
.with_connect_timeout(Duration::from_millis(50));
|
||||
let outcome = run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(20), Duration::from_millis(100)),
|
||||
)
|
||||
.await;
|
||||
match outcome {
|
||||
ProbeOutcome::Fail {
|
||||
failure: ProbeFailure::Timeout { .. },
|
||||
..
|
||||
} => {}
|
||||
other => panic!("expected Timeout, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fatal_on_unparseable_address() {
|
||||
// Connect to a syntactically invalid address. No amount of
|
||||
// retrying will help; we should short-circuit to Fatal.
|
||||
let probe = TcpReachable::new(name("bad-addr"), "not-an-address");
|
||||
let outcome = run_probe(
|
||||
&probe,
|
||||
&NoopTopology,
|
||||
RetryPolicy::polling(Duration::from_millis(20), Duration::from_secs(5)),
|
||||
)
|
||||
.await;
|
||||
match outcome {
|
||||
ProbeOutcome::Fail {
|
||||
failure: ProbeFailure::Rejected { detail },
|
||||
attempts,
|
||||
..
|
||||
} => {
|
||||
assert!(detail.contains("not-an-address"));
|
||||
assert_eq!(attempts, 1, "Fatal must short-circuit on attempt 1");
|
||||
}
|
||||
other => panic!("expected Rejected, got {other:?}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
287
fleet/harmony-fleet-deploy/src/companion/smoke/suite.rs
Normal file
287
fleet/harmony-fleet-deploy/src/companion/smoke/suite.rs
Normal file
@@ -0,0 +1,287 @@
|
||||
//! Smoke suite — an ordered sequence of [`Probe`]s with per-probe
|
||||
//! retry budgets, and the report that comes back from running it.
|
||||
//!
|
||||
//! Suites are sequential by design in Phase 0. The mental model is
|
||||
//! "first prove the basic connectivity, then prove the derived
|
||||
//! state" — which is inherently ordered. Parallel stages are a
|
||||
//! deferred extension; the public API never exposed sequentiality
|
||||
//! as a guarantee, so adding `parallel_stage` later is additive.
|
||||
//!
|
||||
//! A [`SmokeReport`] is a pure value — easy to log, easy to
|
||||
//! serialize for the dashboard, easy to assert against in tests. It
|
||||
//! deliberately mirrors the
|
||||
//! [`Outcome`](harmony::interpret::Outcome)/[`InterpretError`](harmony::interpret::InterpretError)
|
||||
//! split that the rest of the framework uses: one type per "what
|
||||
//! happened", not a Result soup.
|
||||
|
||||
use std::time::{Duration, Instant};
|
||||
|
||||
use harmony::topology::Topology;
|
||||
use tracing::{info, info_span};
|
||||
|
||||
use super::probe::{Probe, ProbeName, ProbeOutcome, RetryPolicy, run_probe};
|
||||
|
||||
/// One stage in a [`SmokeSuite`]: a probe plus the retry budget that
|
||||
/// applies to *this* invocation.
|
||||
struct SmokeStage<T: Topology> {
|
||||
probe: Box<dyn Probe<T>>,
|
||||
policy: RetryPolicy,
|
||||
}
|
||||
|
||||
/// An ordered, retry-aware composition of probes for one
|
||||
/// [`SmokeTest`](super::SmokeTest).
|
||||
///
|
||||
/// Built via [`SmokeSuite::new`] + [`SmokeSuite::stage`]. Run via
|
||||
/// [`SmokeSuite::run`]. The same probe type can appear multiple
|
||||
/// times with different policies — that's the point of supplying
|
||||
/// the policy at the stage level, not on the probe itself.
|
||||
pub struct SmokeSuite<T: Topology> {
|
||||
stages: Vec<SmokeStage<T>>,
|
||||
}
|
||||
|
||||
impl<T: Topology> Default for SmokeSuite<T> {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl<T: Topology> std::fmt::Debug for SmokeSuite<T> {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
f.debug_struct("SmokeSuite")
|
||||
.field(
|
||||
"stages",
|
||||
&self
|
||||
.stages
|
||||
.iter()
|
||||
.map(|s| s.probe.name().as_str())
|
||||
.collect::<Vec<_>>(),
|
||||
)
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
|
||||
impl<T: Topology> SmokeSuite<T> {
|
||||
pub fn new() -> Self {
|
||||
Self { stages: Vec::new() }
|
||||
}
|
||||
|
||||
/// Append a stage. Builder-style so an `assemble` method reads
|
||||
/// like a checklist:
|
||||
///
|
||||
/// ```ignore
|
||||
/// SmokeSuite::new()
|
||||
/// .stage(TcpReachable::new(name, "nats:4222"), RetryPolicy::polling(...))
|
||||
/// .stage(HttpHealthy::new(name, "/healthz"), RetryPolicy::polling(...))
|
||||
/// ```
|
||||
pub fn stage<P>(mut self, probe: P, policy: RetryPolicy) -> Self
|
||||
where
|
||||
P: Probe<T> + 'static,
|
||||
{
|
||||
self.stages.push(SmokeStage {
|
||||
probe: Box::new(probe) as Box<dyn Probe<T>>,
|
||||
policy,
|
||||
});
|
||||
self
|
||||
}
|
||||
|
||||
pub fn len(&self) -> usize {
|
||||
self.stages.len()
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.stages.is_empty()
|
||||
}
|
||||
|
||||
/// Drive every stage in declared order. Stops on the first
|
||||
/// failure — a smoke test is a chain of preconditions, so
|
||||
/// continuing past a failure rarely produces useful information
|
||||
/// and risks flooding the dashboard with cascade failures.
|
||||
pub async fn run(self, topology: &T) -> SmokeReport {
|
||||
let span = info_span!("smoke_suite", probes = self.stages.len());
|
||||
let _guard = span.enter();
|
||||
|
||||
let start = Instant::now();
|
||||
let mut probes = Vec::with_capacity(self.stages.len());
|
||||
|
||||
for stage in self.stages {
|
||||
let name = stage.probe.name().clone();
|
||||
info!(probe = %name, "smoke stage started");
|
||||
let outcome = run_probe(&*stage.probe, topology, stage.policy).await;
|
||||
let passed = outcome.passed();
|
||||
info!(
|
||||
probe = %name,
|
||||
passed,
|
||||
elapsed_ms = elapsed_ms(&outcome),
|
||||
"smoke stage finished",
|
||||
);
|
||||
probes.push(ProbeReport { name, outcome });
|
||||
if !passed {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
SmokeReport {
|
||||
probes,
|
||||
elapsed: start.elapsed(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn elapsed_ms(outcome: &ProbeOutcome) -> u64 {
|
||||
let d = match outcome {
|
||||
ProbeOutcome::Pass { elapsed, .. } | ProbeOutcome::Fail { elapsed, .. } => *elapsed,
|
||||
};
|
||||
d.as_millis() as u64
|
||||
}
|
||||
|
||||
/// What one probe stage produced. The `name` is repeated here (also
|
||||
/// owned by the probe itself) so a `SmokeReport` is fully
|
||||
/// self-contained — the probe `Box` is dropped after the suite runs.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ProbeReport {
|
||||
pub name: ProbeName,
|
||||
pub outcome: ProbeOutcome,
|
||||
}
|
||||
|
||||
impl ProbeReport {
|
||||
pub fn passed(&self) -> bool {
|
||||
self.outcome.passed()
|
||||
}
|
||||
}
|
||||
|
||||
/// What [`SmokeSuite::run`] returns. The dashboard reads probes top
|
||||
/// to bottom; the orchestrator reads `passed()` to gate the deploy.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SmokeReport {
|
||||
pub probes: Vec<ProbeReport>,
|
||||
pub elapsed: Duration,
|
||||
}
|
||||
|
||||
impl SmokeReport {
|
||||
/// `true` iff every probe in the suite passed.
|
||||
///
|
||||
/// An empty suite passes — there's nothing to fail. Callers who
|
||||
/// want "a smoke test ran" semantics should check `!is_empty()`
|
||||
/// in addition.
|
||||
pub fn passed(&self) -> bool {
|
||||
self.probes.iter().all(ProbeReport::passed)
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.probes.is_empty()
|
||||
}
|
||||
|
||||
/// Iterator over only the failed probes — convenient for
|
||||
/// rendering "X failures of Y" without filtering at the call
|
||||
/// site.
|
||||
pub fn failed(&self) -> impl Iterator<Item = &ProbeReport> {
|
||||
self.probes.iter().filter(|p| !p.passed())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::companion::smoke::probe::{ProbeAttempt, ProbeFailure};
|
||||
use async_trait::async_trait;
|
||||
use harmony::topology::PreparationOutcome;
|
||||
|
||||
#[derive(Debug)]
|
||||
struct NoopTopology;
|
||||
|
||||
#[async_trait]
|
||||
impl Topology for NoopTopology {
|
||||
fn name(&self) -> &str {
|
||||
"noop"
|
||||
}
|
||||
async fn ensure_ready(
|
||||
&self,
|
||||
) -> Result<PreparationOutcome, harmony::topology::PreparationError> {
|
||||
Ok(PreparationOutcome::Noop)
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
struct Always {
|
||||
name: ProbeName,
|
||||
attempt: ProbeAttempt,
|
||||
}
|
||||
|
||||
impl Always {
|
||||
fn new(name: &str, attempt: ProbeAttempt) -> Self {
|
||||
Self {
|
||||
name: ProbeName::try_new(name).unwrap(),
|
||||
attempt,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Topology> Probe<T> for Always {
|
||||
fn name(&self) -> &ProbeName {
|
||||
&self.name
|
||||
}
|
||||
async fn check(&self, _t: &T) -> ProbeAttempt {
|
||||
self.attempt.clone()
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn empty_suite_passes_vacuously() {
|
||||
let report = SmokeSuite::<NoopTopology>::new().run(&NoopTopology).await;
|
||||
assert!(report.passed());
|
||||
assert!(report.is_empty());
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn all_pass_stages_aggregates_pass() {
|
||||
let report = SmokeSuite::<NoopTopology>::new()
|
||||
.stage(
|
||||
Always::new("a", ProbeAttempt::Ok(None)),
|
||||
RetryPolicy::once(),
|
||||
)
|
||||
.stage(
|
||||
Always::new("b", ProbeAttempt::Ok(Some("ok".to_string()))),
|
||||
RetryPolicy::once(),
|
||||
)
|
||||
.run(&NoopTopology)
|
||||
.await;
|
||||
assert!(report.passed());
|
||||
assert_eq!(report.probes.len(), 2);
|
||||
assert!(report.failed().count() == 0);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn first_failure_short_circuits_remaining_stages() {
|
||||
// The third stage must never run. We assert by name —
|
||||
// its presence in the report would prove a regression.
|
||||
let report = SmokeSuite::<NoopTopology>::new()
|
||||
.stage(
|
||||
Always::new("a", ProbeAttempt::Ok(None)),
|
||||
RetryPolicy::once(),
|
||||
)
|
||||
.stage(
|
||||
Always::new("b", ProbeAttempt::Fatal("nope".to_string())),
|
||||
RetryPolicy::once(),
|
||||
)
|
||||
.stage(
|
||||
Always::new("c", ProbeAttempt::Ok(None)),
|
||||
RetryPolicy::once(),
|
||||
)
|
||||
.run(&NoopTopology)
|
||||
.await;
|
||||
assert!(!report.passed());
|
||||
assert_eq!(report.probes.len(), 2);
|
||||
assert_eq!(report.probes[0].name.as_str(), "a");
|
||||
assert_eq!(report.probes[1].name.as_str(), "b");
|
||||
// Cascading failures are noise — confirm only the actual
|
||||
// culprit appears.
|
||||
match &report.probes[1].outcome {
|
||||
ProbeOutcome::Fail {
|
||||
failure: ProbeFailure::Rejected { detail },
|
||||
..
|
||||
} => assert_eq!(detail, "nope"),
|
||||
other => panic!("expected Rejected, got {other:?}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -18,14 +18,21 @@
|
||||
//! VM/SSH target is `FleetDeviceSetupScore` in `harmony` core,
|
||||
//! slated to move here in a follow-up).
|
||||
//!
|
||||
//! - [`companion::smoke`] — the smoke-test contract. `Probe` trait,
|
||||
//! `SmokeSuite` composition, `SmokeTest` companion, and the
|
||||
//! `deploy`/`deploy_with_smoke` free functions. Implements ADR-023
|
||||
//! P4 ("deploy returns only after smoke-test success") through the
|
||||
//! companion seam from P7. First-party `TcpReachable` probe ships
|
||||
//! in this PR; HTTP/K8s/NATS probes and a real
|
||||
//! `FleetOperatorSmokeTest` follow in Phase 1.
|
||||
//!
|
||||
//! Out of scope for now:
|
||||
//!
|
||||
//! - **Callout** (mock-OIDC fixture + `NatsAuthCalloutScore` wiring) —
|
||||
//! PR 1.5.
|
||||
//! - **Smoke-test contract**. ADR-023 principle 4 says `deploy`
|
||||
//! returns only after a smoke test passes; the trait/struct shape
|
||||
//! for that contract is open. The e2e harness's `ping` test plays
|
||||
//! the role of the smoke test for now.
|
||||
//! - **Phase 1 smoke wiring**. Concrete `FleetOperatorSmokeTest`
|
||||
//! plus HTTP/K8s/NATS probes plus optional `HarmonyEvent` variants
|
||||
//! for the dashboard.
|
||||
|
||||
pub mod agent;
|
||||
pub mod companion;
|
||||
@@ -35,6 +42,7 @@ pub mod server;
|
||||
|
||||
pub use agent::{FleetAgentScore, PodTarget};
|
||||
pub use companion::AgentObservation;
|
||||
pub use companion::smoke;
|
||||
pub use nats::{FleetNatsScore, UserPassCredentials};
|
||||
pub use operator::{FleetOperatorScore, OperatorCredentials};
|
||||
pub use server::FleetServerScore;
|
||||
|
||||
Reference in New Issue
Block a user
Idea : associate an interpret type instead? The idea is that we have many scores that point to the same interpret. Of course locking to a score makes the code smaller and easier to understand, but will inevitably lead to boilerplate and a lot of repetition when similar scores exist. For example a smoke test on a HelmChartScore that valudates the helm chart is ready would not work with a NatsHelmChartScore as it is not the same type at the top level but would work with both if we use the Interpret type which is the same for both.