feat(fleet-deploy): log-tail contract as a Score companion #295

Closed

johnride wants to merge 2 commits from feat/v0-3-logs-companion into feat/smoke-test-contract

Author	SHA1	Message	Date
Jean-Gabriel Gill-Couture	0793e72a05	feat(fleet-deploy): log-tail contract as a Score companion All checks were successful Run Check Script / check (pull_request) Successful in 2m30s Details Third Score companion (after AgentObservation and SmokeTest), per ADR-023 P7 — new framework capabilities attach as companions rather than as additions to the Score / Interpret public API. Powers the dashboard's "View logs" UX: customer clicks the button, gets the last N lines of a deployment's container output from the device. Trait + transport-side impl + unit tests ship now. The agent-side Verb::Logs handler and the operator dashboard handler land in follow-up PRs against the contract locked here — splitting keeps each diff focused and reviewable. Three small types + a NATS-backed impl, zero edits to Score / Interpret / Maestro: LogChunk pure value: source identifier, captured_at, lines (oldest-first), truncated flag. No transport, no async, no IO — the dashboard renders it, the transport layer constructs it. LogQueryError six arms, each mapped to a distinct operator action (DeviceOffline vs Timeout vs Agent vs BadReply vs Transport vs InvalidReply). Mirrors the FleetCommandsClient::CommandError shape used by Verb::Ping so callers see uniform error surfaces across verbs. LogQuery<T> companion trait paired with a Score by associated type — Q: LogQuery<T, Score = S> is the same compile-time lock SmokeTest uses. A future K8sLogQuery follows the same shape, no Box<dyn LogQuery> needed (topologies are compile-time per ADR-023 P6). PodmanLogQuery NATS request/reply impl targeting device-commands.<id>.logs. Splits routing (LogQueryRouting) from transport so unit tests verify the exact wire bytes without a NATS client. Saturates LogsRequest.lines at LOGS_MAX_LINES on the operator side as defense-in-depth (the agent will clamp again). reconciler-contracts gains Verb::Logs, LogsRequest, LogsReply, and the LOGS_MAX_LINES bound. The wire shape lives there (not in the deploy crate) so the agent build — which must not depend on harmony — can serialize the same bytes. Adding the verb required zero permission template changes: the agent's existing device-commands.<id>.> subscription already covers it, and the verb stays the trailing subject token so Verb::as_subject_token keeps its invariant. Tests assert behavior, not shape: subject_matches_documented_format locks the wire so a callout permission change can't silently break routing, request_body_clamps_oversized_n proves the buggy-dashboard-show-all-button can't get through unchecked, decode_reply_rejects_invalid_source_name proves a malicious agent can't smuggle control characters past ProbeName validation, and paired_score_type_is_podman_v0_score is a compile-time check that catches refactors changing the associated type without updating callers. 77 unit tests total across both crates, all passing without requiring a real podman socket or NATS server. Deferred (in scope of v0.3, separate PRs): - Agent-side Verb::Logs handler in command_server.rs (parses LogsRequest, resolves deployment->container with stricter [a-zA-Z0-9_.-]{1,128} validation, runs podman logs --tail, serializes LogsReply). - Operator dashboard handler at /deployments/<name>/devices/<id>/logs. - End-to-end integration test through a real podman container.	2026-05-26 07:06:58 -04:00
Jean-Gabriel Gill-Couture	1d6fb40236	refactor(fleet-deploy): collapse smoke companion to one trait, one method All checks were successful Run Check Script / check (pull_request) Successful in 2m24s Details Replaces the seven-file `companion/smoke/` directory (Probe trait + ProbeAttempt + ProbeOutcome + ProbeFailure + ProbeName + RetryPolicy + run_probe + SmokeSuite + SmokeStage + SmokeReport + SmokeTest + SmokeAssemblyError, ~1500 LOC) with a single `companion/smoke.rs`. Net delta: +638 / -1578. The earlier draft was cardinality matching gone overboard. For what is, in the end, "after deploy, run an async function and surface a pipeline report", we now ship: * `SmokeTest<T>` trait — one async method `verify(&T) -> SmokeReport`. Implementers write a regular async fn; no probe abstraction to learn, no suite builder, no policy value type. * `SmokeReport { checks: Vec<CheckReport> }` — the pipeline data the dashboard renders top-to-bottom. A report passes iff it has at least one check AND every check passed (a smoke impl that forgets to push checks fails loudly, not silently). * `CheckReport { name, passed, detail }` + `pass / pass_with / fail` constructors. * `poll_until` and `tcp_reachable` — free functions, not traits. Implementers call them inside `verify` when useful. A future `http_healthy` or `k8s_pod_ready` lives in this file as another `async fn -> CheckReport`, not as another trait impl. * `deploy` / `deploy_with_smoke` — free functions, returning `DeployOutcome { interpret, smoke }`. `deploy_with_smoke` blocks the deploy on smoke success (ADR-023 P4). Per the PR #292 review: the SmokeTest's associated type is now `type Interpret: Interpret<T>` (was `type Score: Score<T>`). One smoke impl can cover every Score that shares an Interpret — HelmChartScore + NatsHelmChartScore both verified by one HelmChartSmokeTest. Pairing is convention-only in v0.3 because `Score::create_interpret` still returns `Box<dyn Interpret<T>>`; closing the loop is an additive `type Interpret` on `Score`, deferred. Tests (12 in this module): - deploy runs interpret and returns the underlying Outcome - deploy_with_smoke runs smoke only after interpret succeeds - deploy_with_smoke returns SmokeFailed (with interpret preserved) when any check fails - deploy_with_smoke rejects an empty report — no silent pass-through - deploy_with_smoke skips smoke entirely when interpret fails - SmokeReport::passed semantics (nonempty + all pass) - poll_until pass-on-success, fail-on-budget - tcp_reachable pass against a real loopback listener - tcp_reachable fail with timeout against TEST-NET-1	2026-05-26 06:56:16 -04:00

Author

SHA1

Message

Date

Jean-Gabriel Gill-Couture

0793e72a05

feat(fleet-deploy): log-tail contract as a Score companion

Run Check Script / check (pull_request) Successful in 2m30s

Details

Third Score companion (after AgentObservation and SmokeTest), per
ADR-023 P7 — new framework capabilities attach as companions rather
than as additions to the Score / Interpret public API. Powers the
dashboard's "View logs" UX: customer clicks the button, gets the last
N lines of a deployment's container output from the device.

Trait + transport-side impl + unit tests ship now. The agent-side
Verb::Logs handler and the operator dashboard handler land in
follow-up PRs against the contract locked here — splitting keeps each
diff focused and reviewable.

Three small types + a NATS-backed impl, zero edits to Score /
Interpret / Maestro:

  LogChunk          pure value: source identifier, captured_at, lines
                    (oldest-first), truncated flag. No transport,
                    no async, no IO — the dashboard renders it,
                    the transport layer constructs it.
  LogQueryError     six arms, each mapped to a distinct operator
                    action (DeviceOffline vs Timeout vs Agent vs
                    BadReply vs Transport vs InvalidReply). Mirrors
                    the FleetCommandsClient::CommandError shape used
                    by Verb::Ping so callers see uniform error
                    surfaces across verbs.
  LogQuery<T>       companion trait paired with a Score by associated
                    type — Q: LogQuery<T, Score = S> is the same
                    compile-time lock SmokeTest uses. A future
                    K8sLogQuery follows the same shape, no
                    Box<dyn LogQuery> needed (topologies are
                    compile-time per ADR-023 P6).
  PodmanLogQuery    NATS request/reply impl targeting
                    device-commands.<id>.logs. Splits routing
                    (LogQueryRouting) from transport so unit tests
                    verify the exact wire bytes without a NATS
                    client. Saturates LogsRequest.lines at
                    LOGS_MAX_LINES on the operator side as
                    defense-in-depth (the agent will clamp again).

reconciler-contracts gains Verb::Logs, LogsRequest, LogsReply, and
the LOGS_MAX_LINES bound. The wire shape lives there (not in the
deploy crate) so the agent build — which must not depend on harmony
— can serialize the same bytes. Adding the verb required zero
permission template changes: the agent's existing
device-commands.<id>.> subscription already covers it, and the
verb stays the trailing subject token so Verb::as_subject_token
keeps its invariant.

Tests assert behavior, not shape: subject_matches_documented_format
locks the wire so a callout permission change can't silently break
routing, request_body_clamps_oversized_n proves the
buggy-dashboard-show-all-button can't get through unchecked,
decode_reply_rejects_invalid_source_name proves a malicious agent
can't smuggle control characters past ProbeName validation, and
paired_score_type_is_podman_v0_score is a compile-time check that
catches refactors changing the associated type without updating
callers. 77 unit tests total across both crates, all passing without
requiring a real podman socket or NATS server.

Deferred (in scope of v0.3, separate PRs):
  - Agent-side Verb::Logs handler in command_server.rs (parses
    LogsRequest, resolves deployment->container with stricter
    [a-zA-Z0-9_.-]{1,128} validation, runs podman logs --tail,
    serializes LogsReply).
  - Operator dashboard handler at
    /deployments/<name>/devices/<id>/logs.
  - End-to-end integration test through a real podman container.

2026-05-26 07:06:58 -04:00

Jean-Gabriel Gill-Couture

1d6fb40236

refactor(fleet-deploy): collapse smoke companion to one trait, one method

Run Check Script / check (pull_request) Successful in 2m24s

Details

Replaces the seven-file `companion/smoke/` directory (Probe trait +
ProbeAttempt + ProbeOutcome + ProbeFailure + ProbeName + RetryPolicy +
run_probe + SmokeSuite + SmokeStage + SmokeReport + SmokeTest +
SmokeAssemblyError, ~1500 LOC) with a single `companion/smoke.rs`.

Net delta: +638 / -1578.

The earlier draft was cardinality matching gone overboard. For what
is, in the end, "after deploy, run an async function and surface a
pipeline report", we now ship:

* `SmokeTest<T>` trait — one async method `verify(&T) -> SmokeReport`.
  Implementers write a regular async fn; no probe abstraction to
  learn, no suite builder, no policy value type.
* `SmokeReport { checks: Vec<CheckReport> }` — the pipeline data the
  dashboard renders top-to-bottom. A report passes iff it has at
  least one check AND every check passed (a smoke impl that forgets
  to push checks fails loudly, not silently).
* `CheckReport { name, passed, detail }` + `pass / pass_with / fail`
  constructors.
* `poll_until` and `tcp_reachable` — free functions, not traits.
  Implementers call them inside `verify` when useful. A future
  `http_healthy` or `k8s_pod_ready` lives in this file as another
  `async fn -> CheckReport`, not as another trait impl.
* `deploy` / `deploy_with_smoke` — free functions, returning
  `DeployOutcome { interpret, smoke }`. `deploy_with_smoke` blocks
  the deploy on smoke success (ADR-023 P4).

Per the PR #292 review: the SmokeTest's associated type is now
`type Interpret: Interpret<T>` (was `type Score: Score<T>`). One
smoke impl can cover every Score that shares an Interpret —
HelmChartScore + NatsHelmChartScore both verified by one
HelmChartSmokeTest. Pairing is convention-only in v0.3 because
`Score::create_interpret` still returns `Box<dyn Interpret<T>>`;
closing the loop is an additive `type Interpret` on `Score`, deferred.

Tests (12 in this module):
- deploy runs interpret and returns the underlying Outcome
- deploy_with_smoke runs smoke only after interpret succeeds
- deploy_with_smoke returns SmokeFailed (with interpret preserved)
  when any check fails
- deploy_with_smoke rejects an empty report — no silent pass-through
- deploy_with_smoke skips smoke entirely when interpret fails
- SmokeReport::passed semantics (nonempty + all pass)
- poll_until pass-on-success, fail-on-budget
- tcp_reachable pass against a real loopback listener
- tcp_reachable fail with timeout against TEST-NET-1

2026-05-26 06:56:16 -04:00

feat(fleet-deploy): log-tail contract as a Score companion #295

2 Commits