feat(fleet-deploy): smoke-test contract as a Score companion #292

Open
johnride wants to merge 3 commits from feat/smoke-test-contract into master
Owner

Phase 0 of the smoke-test contract from ADR-023 P4 ("deploy returns
only after smoke-test success"). Lands inside harmony-fleet-deploy
under companion/smoke/ so the shape can be validated against a real
consumer before promoting to a top-level crate.

Three small types + a wrapper, zero edits to the Score / Interpret /
Maestro public API:

Probe single-attempt unit; classifies its own observation
as Ok / Retry / Fatal via ProbeAttempt.
SmokeSuite ordered list of (probe, RetryPolicy) stages; sequential
by design.
SmokeTest companion trait paired with a Score by associated type
SM: SmokeTest<T, Score = S> is the compile-time
lock (ADR-024 §2 / JG's compile-time-feedback principle).
deploy / free functions binding Score::interpret to an optional
deploy_with_ SmokeTest. The smoke variant returns only after the
smoke suite passes; a failed probe is DeployError::SmokeFailed
with the full report attached.

Cardinality choices follow JG's For the Love of Compilers talk:
ProbeName is a validated newtype (rejected: empty / control-chars /

128 bytes), ProbeAttempt is a three-arm sum because "not ready yet"
and "definitively no" drive different orchestration paths, ProbeFailure
splits Timeout from Rejected because the operator actions are different.
RetryPolicy::polling panics on a zero interval at construction rather
than spinning the executor in production.

One concrete probe ships in this PR — TcpReachable — to validate the
trait shape against real network I/O. Run_probe orchestrates retry +
timeout outside the probe itself so each probe stays single-attempt
and unit-testable.

Stage progress is emitted via tracing::info_span! / info! today.
Adding HarmonyEvent::SmokeStage* variants is deferred to Phase 1
(Score / Interpret stay untouched in Phase 0).

Phase 1 follow-ups: HttpHealthy, K8sPodReady, NatsKvKeyExists probes;
a real FleetOperatorSmokeTest composed of those; optional dashboard
event variants. None of those touch the contract — that's the point
of locking the shape here.

Tests: 21 new (4 in tcp::tests, 8 in probe::tests, 3 in suite::tests,
4 in deploy::tests). Full crate: 39 / 39 passing. No regressions.
build/check.sh equivalents: cargo check --all-targets --all-features
clean, cargo fmt --check clean, cargo clippy with zero findings on
the new module.

Phase 0 of the smoke-test contract from ADR-023 P4 ("deploy returns only after smoke-test success"). Lands inside harmony-fleet-deploy under companion/smoke/ so the shape can be validated against a real consumer before promoting to a top-level crate. Three small types + a wrapper, zero edits to the Score / Interpret / Maestro public API: Probe single-attempt unit; classifies its own observation as Ok / Retry / Fatal via ProbeAttempt. SmokeSuite ordered list of (probe, RetryPolicy) stages; sequential by design. SmokeTest companion trait paired with a Score by associated type — `SM: SmokeTest<T, Score = S>` is the compile-time lock (ADR-024 §2 / JG's compile-time-feedback principle). deploy / free functions binding Score::interpret to an optional deploy_with_ SmokeTest. The smoke variant returns only after the smoke suite passes; a failed probe is DeployError::SmokeFailed with the full report attached. Cardinality choices follow JG's *For the Love of Compilers* talk: ProbeName is a validated newtype (rejected: empty / control-chars / >128 bytes), ProbeAttempt is a three-arm sum because "not ready yet" and "definitively no" drive different orchestration paths, ProbeFailure splits Timeout from Rejected because the operator actions are different. RetryPolicy::polling panics on a zero interval at construction rather than spinning the executor in production. One concrete probe ships in this PR — TcpReachable — to validate the trait shape against real network I/O. Run_probe orchestrates retry + timeout outside the probe itself so each probe stays single-attempt and unit-testable. Stage progress is emitted via tracing::info_span! / info! today. Adding HarmonyEvent::SmokeStage* variants is deferred to Phase 1 (Score / Interpret stay untouched in Phase 0). Phase 1 follow-ups: HttpHealthy, K8sPodReady, NatsKvKeyExists probes; a real FleetOperatorSmokeTest composed of those; optional dashboard event variants. None of those touch the contract — that's the point of locking the shape here. Tests: 21 new (4 in tcp::tests, 8 in probe::tests, 3 in suite::tests, 4 in deploy::tests). Full crate: 39 / 39 passing. No regressions. build/check.sh equivalents: cargo check --all-targets --all-features clean, cargo fmt --check clean, cargo clippy with zero findings on the new module.
johnride added 1 commit 2026-05-23 22:07:54 +00:00
feat(fleet-deploy): smoke-test contract as a Score companion
All checks were successful
Run Check Script / check (pull_request) Successful in 2m33s
1e898a7328
Phase 0 of the smoke-test contract from ADR-023 P4 ("deploy returns
only after smoke-test success"). Lands inside harmony-fleet-deploy
under companion/smoke/ so the shape can be validated against a real
consumer before promoting to a top-level crate.

Three small types + a wrapper, zero edits to the Score / Interpret /
Maestro public API:

  Probe         single-attempt unit; classifies its own observation
                as Ok / Retry / Fatal via ProbeAttempt.
  SmokeSuite    ordered list of (probe, RetryPolicy) stages; sequential
                by design.
  SmokeTest     companion trait paired with a Score by associated type
                — `SM: SmokeTest<T, Score = S>` is the compile-time
                lock (ADR-024 §2 / JG's compile-time-feedback principle).
  deploy /      free functions binding Score::interpret to an optional
  deploy_with_  SmokeTest. The smoke variant returns only after the
  smoke         suite passes; a failed probe is DeployError::SmokeFailed
                with the full report attached.

Cardinality choices follow JG's *For the Love of Compilers* talk:
ProbeName is a validated newtype (rejected: empty / control-chars /
>128 bytes), ProbeAttempt is a three-arm sum because "not ready yet"
and "definitively no" drive different orchestration paths, ProbeFailure
splits Timeout from Rejected because the operator actions are different.
RetryPolicy::polling panics on a zero interval at construction rather
than spinning the executor in production.

One concrete probe ships in this PR — TcpReachable — to validate the
trait shape against real network I/O. Run_probe orchestrates retry +
timeout outside the probe itself so each probe stays single-attempt
and unit-testable.

Stage progress is emitted via tracing::info_span! / info! today.
Adding HarmonyEvent::SmokeStage* variants is deferred to Phase 1
(Score / Interpret stay untouched in Phase 0).

Phase 1 follow-ups: HttpHealthy, K8sPodReady, NatsKvKeyExists probes;
a real FleetOperatorSmokeTest composed of those; optional dashboard
event variants. None of those touch the contract — that's the point
of locking the shape here.

Tests: 21 new (4 in tcp::tests, 8 in probe::tests, 3 in suite::tests,
4 in deploy::tests). Full crate: 39 / 39 passing. No regressions.
build/check.sh equivalents: cargo check --all-targets --all-features
clean, cargo fmt --check clean, cargo clippy with zero findings on
the new module.
johnride added 2 commits 2026-05-25 12:28:51 +00:00
docs(fleet): v0.3 last-mile roadmap
All checks were successful
Run Check Script / check (pull_request) Successful in 2m25s
9deebab1ff
Authoritative plan for the last mile before fleet ships to a real
customer. Picks up where v0_2_plan.md left the chapter structure.

Twelve chapters, organized in execution order:

  1. Dashboard role enforcement (security gap, do right now)
  2. Operator restart + aggregator recovery (more critical than smoke)
  3. Application log forwarding companion (dashboard utility)
  4. Agent self-upgrade, NATS-coordinated, systemd-resident
  5. Graceful deployment upgrade (roll-forward only — customer ask)
  6. Init containers in PodmanV0Score
  7. System upgrade, rollback deferred to v0.4
  8. Secrets via Zitadel + OpenBao (blocked on harmony_secret work)
  9. Agent time-drift verification
  10. Phase 1 smoke wiring
  11. CI yaml minimization (longer-term)
  12. NATS callout CI hardening (minimal)

Customer constraints baked in: deployments are roll-forward only
(no auto-rollback on Deployment failure); system rollback half of
the upgrade ADR is deferred to v0.4 (snapshot is created but not
used for revert in v0.3); secrets must go through Zitadel + OpenBao
(no plaintext shortcut).

Includes:
  - feature checklist as a status table (14 items),
  - sequencing table with ordering rationale,
  - per-chapter goal / current state with file:line citations /
    plan / open questions / "done when",
  - out-of-scope table with target version + reason,
  - cross-cutting open questions Q1–Q5.

Format follows the user's "tables over prose" preference: every
multi-item section is either a table or bold-led bullets with
nested supporting detail. Scannable at three depths (30-second
scroll for bold leads, 2-minute read for nested detail, deep read
with code where it matters).
Merge pull request 'docs(fleet): v0.3 last-mile roadmap' (#296) from docs/v0-3-roadmap into feat/smoke-test-contract
All checks were successful
Run Check Script / check (pull_request) Successful in 2m20s
269ab2fbed
Reviewed-on: #296
johnride reviewed 2026-05-25 12:39:32 +00:00
johnride left a comment
Author
Owner

This all looks correct, but I have a feeling we could find a simpler design that does not involve the whole smokesuite -> smoke -> probe -> probeoutcome -> probestatus boilerplate.

It's not too bad, I think the design is mostly sound, the genericity on T is sound and makes things safe to run.

Aside from the associated type on Score I don't have any problem with this design. I'm just not sure it is the correct approach.

Let's take a step back and explore a few out of the box ideas.

This all looks correct, but I have a feeling we could find a simpler design that does not involve the whole smokesuite -> smoke -> probe -> probeoutcome -> probestatus boilerplate. It's not too bad, I think the design is mostly sound, the genericity on T is sound and makes things safe to run. Aside from the associated type on Score I don't have any problem with this design. I'm just not sure it is the correct approach. Let's take a step back and explore a few out of the box ideas.
@@ -0,0 +52,4 @@
pub trait SmokeTest<T: Topology>: Send + Sync {
/// The Score this smoke test verifies. The type lock means
/// `SM::Score = S` is enforced at every call site.
type Score: Score<T>;
Author
Owner

Idea : associate an interpret type instead? The idea is that we have many scores that point to the same interpret. Of course locking to a score makes the code smaller and easier to understand, but will inevitably lead to boilerplate and a lot of repetition when similar scores exist. For example a smoke test on a HelmChartScore that valudates the helm chart is ready would not work with a NatsHelmChartScore as it is not the same type at the top level but would work with both if we use the Interpret type which is the same for both.

Idea : associate an interpret type instead? The idea is that we have many scores that point to the same interpret. Of course locking to a score makes the code smaller and easier to understand, but will inevitably lead to boilerplate and a lot of repetition when similar scores exist. For example a smoke test on a HelmChartScore that valudates the helm chart is ready would not work with a NatsHelmChartScore as it is not the same type at the top level but would work with both if we use the Interpret type which is the same for both.
All checks were successful
Run Check Script / check (pull_request) Successful in 2m20s
This pull request has changes conflicting with the target branch.
  • ROADMAP/fleet_platform/v0_3_plan.md
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/smoke-test-contract:feat/smoke-test-contract
git checkout feat/smoke-test-contract
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#292
No description provided.