Files

Jean-Gabriel Gill-Couture 020ebcb1f9 refactor(fleet): deploy-architecture cleanup per ADR-023 — Scores everywhere, deploy crate, principles in CLAUDE.md

The previous e2e harness handrolled k8s manifests in `stack.rs`,
bypassing the Score-Topology-Interpret machinery harmony exists to
provide. This commit:

1. **ADR-023** codifies the rules: deploy with Scores (not
   manifests), e2e uses the same Scores as production, one Score
   per component, deploy blocks on smoke-test success, deploy logic
   lives in `*-deploy` crates, topologies are compile-time,
   thiserror over anyhow. CLAUDE.md mirrors the principles.

2. **New `fleet/harmony-fleet-deploy` crate** is the canonical home
   for fleet-component Scores:
   - `FleetOperatorScore` + helm-chart generator + `install_crds`
     moved out of `harmony::modules::fleet::operator` (they should
     never have lived in `harmony` core). `FleetServerScore`
     (composite of NATS + operator + Zitadel + callout) moved too.
   - New `FleetNatsScore` (preset over `NatsHelmChartScore` with
     fleet's required values; v1 supports `UserPass` auth, callout
     mode reserved on the public API for PR 1.5).
   - New `FleetAgentScore` with `FleetAgentTarget::Pod`; `Vm`
     target is a future variant that absorbs `FleetDeviceSetupScore`.
   - `harmony-fleet-deploy` binary built on the existing
     `harmony_cli` crate — no new CLI scaffolding.

3. **Operator runtime binary trimmed**: `Install` and `Chart`
   subcommands removed; both jobs now belong to
   `harmony-fleet-deploy`. The runtime binary becomes leaner.

4. **E2E harness rewritten** as a thin Score composer:
   `harmony-fleet-e2e/src/stack.rs` deploys the stack via
   `FleetNatsScore` + `FleetAgentScore`. The inline NATS manifest
   factory and the bespoke agent Pod renderer are gone.
   - Bring-up runs once per test binary via `shared_stack` +
     `tokio::sync::OnceCell` (matches the `fleet_e2e_demo` pattern).
   - Stale `e2e-*` namespaces from prior runs get pruned at
     startup so the leaks the OnceCell creates don't compound.

5. **`thiserror` for the agent's `CommandServer`** — replaces the
   anyhow-based surface with typed `CommandError` /
   `CommandServerError`.

6. **Memory** captures eight load-bearing principles (saved to
   `~/.claude/projects/.../memory/`) so future sessions don't drift
   back into manifest-handrolling.

Verified: `cargo test -p harmony-fleet-e2e --test ping` green
end-to-end against k3d in 25s warm.

2026-05-18 22:54:50 -04:00

12 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Build & Test Commands

# Full CI check (check + fmt + clippy + test)
./build/check.sh

# Individual commands
cargo check --all-targets --all-features --keep-going
cargo fmt --check          # Check formatting
cargo clippy               # Lint
cargo test                 # Run all tests

# Run a single test
cargo test -p <crate_name> <test_name>

# Run a specific example
cargo run -p <example_crate_name>

# Build the mdbook documentation
mdbook build

What Harmony Is

Harmony is the orchestration framework powering NationTech's vision of decentralized micro datacenters — small computing clusters deployed in homes, offices, and community spaces instead of hyperscaler facilities. The goal: make computing cleaner, more resilient, locally beneficial, and resistant to centralized points of failure (including geopolitical threats).

Harmony exists because existing IaC tools (Terraform, Ansible, Helm) are trapped in a YAML mud pit: static configuration files validated only at runtime, fragmented across tools, with errors surfacing at 3 AM instead of at compile time. Harmony replaces this entire class of tools with a single Rust codebase where the compiler catches infrastructure misconfigurations before anything is deployed.

This is not a wrapper around existing tools. It is a paradigm shift: infrastructure-as-real-code with compile-time safety guarantees that no YAML/HCL/DSL-based tool can provide.

The Score-Topology-Interpret Pattern

This is the core design pattern. Understand it before touching the codebase.

Score — declarative desired state. A Rust struct generic over T: Topology that describes what you want (e.g., "a PostgreSQL cluster", "DNS records for these hosts"). Scores are serializable, cloneable, idempotent.

Topology — infrastructure capabilities. Represents where things run and what the environment can do. Exposes capabilities as traits (DnsServer, K8sclient, HelmCommand, LoadBalancer, Firewall, etc.). Examples: K8sAnywhereTopology (local K3D or any K8s cluster), HAClusterTopology (bare-metal HA with redundant firewalls/switches).

Interpret — execution glue. Translates a Score into concrete operations against a Topology's capabilities. Returns an Outcome (SUCCESS, NOOP, FAILURE, RUNNING, QUEUED, BLOCKED).

The key insight — compile-time safety through trait bounds:

impl<T: Topology + DnsServer + DhcpServer> Score<T> for DnsScore { ... }

The compiler rejects any attempt to use DnsScore with a Topology that doesn't implement DnsServer and DhcpServer. Invalid infrastructure configurations become compilation errors, not runtime surprises.

Higher-order topologies compose transparently:

FailoverTopology<T> — primary/replica orchestration
DecentralizedTopology<T> — multi-site coordination

If T: PostgreSQL, then FailoverTopology<T>: PostgreSQL automatically via blanket impls. Zero boilerplate.

Architecture (Hexagonal)

harmony/src/
├── domain/           # Core domain — the heart of the framework
│   ├── score.rs      # Score trait (desired state)
│   ├── topology/     # Topology trait + implementations
│   ├── interpret/    # Interpret trait + InterpretName enum (25+ variants)
│   ├── inventory/    # Physical infrastructure metadata (hosts, switches, mgmt interfaces)
│   ├── executors/    # Executor trait definitions
│   └── maestro/      # Orchestration engine (registers scores, manages topology state, executes)
├── infra/            # Infrastructure adapters (driven ports)
│   ├── opnsense/     # OPNsense firewall adapter
│   ├── brocade.rs    # Brocade switch adapter
│   ├── kube.rs       # Kubernetes executor
│   └── sqlx.rs       # Database executor
└── modules/          # Concrete deployment modules (23+)
    ├── k8s/          # Kubernetes (namespaces, deployments, ingress)
    ├── postgresql/   # CloudNativePG clusters + multi-site failover
    ├── okd/          # OpenShift bare-metal from scratch
    ├── helm/         # Helm chart inflation → vanilla K8s YAML
    ├── opnsense/     # OPNsense (DHCP, DNS, etc.)
    ├── monitoring/   # Prometheus, Alertmanager, Grafana
    ├── kvm/          # KVM virtual machine management
    ├── network/      # Network services (iPXE, TFTP, bonds)
    └── ...

Domain types to know: Inventory (read-only physical infra context), Maestro<T> (orchestrator — calls topology.ensure_ready() then executes scores), Outcome / InterpretError (execution results).

Key Crates

Crate	Purpose
`harmony`	Core framework: domain, infra adapters, deployment modules
`harmony_cli`	CLI + optional TUI (`--features tui`)
`harmony_config`	Unified config+secret management (env → SQLite → OpenBao → interactive prompt)
`harmony_secret` / `harmony_secret_derive`	Secret backends (LocalFile, OpenBao, Infisical)
`harmony_execution`	Execution engine
`harmony_agent` / `harmony_inventory_agent`	Persistent agent framework (NATS JetStream mesh), hardware discovery
`harmony_assets`	Asset management (URLs, local cache, S3)
`harmony_composer`	Infrastructure composition tool
`harmony-k8s`	Kubernetes utilities
`k3d`	Local K3D cluster management
`brocade`	Brocade network switch integration

OPNsense Crates

The opnsense-codegen and opnsense-api crates exist because OPNsense's automation ecosystem is poor — no typed API client exists. These are support crates, not the core of Harmony.

opnsense-codegen: XML model files → IR → Rust structs with serde helpers for OPNsense wire format quirks (opn_bool for "0"/"1" strings, opn_u16/opn_u32 for string-encoded numbers). Vendor sources are git submodules under opnsense-codegen/vendor/.
opnsense-api: Hand-written OpnsenseClient + generated model types in src/generated/.

Key Design Decisions (ADRs in docs/adr/)

ADR-001: Rust chosen for type system, refactoring safety, and performance
ADR-002: Hexagonal architecture — domain isolated from adapters
ADR-003: Infrastructure abstractions at domain level, not provider level (no vendor lock-in)
ADR-005: Custom Rust DSL over YAML/Score-spec — real language, Cargo deps, composable
ADR-007: K3D as default runtime (K8s-certified, lightweight, cross-platform)
ADR-009: Helm charts inflated to vanilla K8s YAML, then deployed via existing code paths
ADR-015: Higher-order topologies via blanket trait impls (zero-cost composition)
ADR-016: Agent-based architecture with NATS JetStream for real-time failover and distributed consensus
ADR-020: Unified config+secret management — Rust struct is the schema, resolution chain: env → store → prompt
ADR-023: Deploy architecture — Scores everywhere (incl. tests), per-component *-deploy crates, deploy blocks on smoke-test, topologies are compile-time

Capability and Score Design Rules

Capabilities are industry concepts, not tools. A capability trait represents a standard infrastructure need (e.g., DnsServer, LoadBalancer, Router, CertificateManagement) that can be fulfilled by different products. OPNsense provides DnsServer today; CoreDNS or Route53 could provide it tomorrow. Scores must not break when the backend changes.

Exception: When the developer fundamentally needs to know the implementation. PostgreSQL is a capability (not Database) because the developer writes PostgreSQL-specific SQL and replication configs. Swapping to MariaDB would break the application, not just the infrastructure.

Test: If you could swap the underlying tool without rewriting any Score that uses the capability, the boundary is correct.

Don't name capabilities after tools. SecretVault not OpenbaoStore. IdentityProvider not ZitadelAuth. Think: what is the core developer need that leads to using this tool?

Scores encapsulate operational complexity. Move procedural knowledge (init sequences, retry logic, distribution-specific config) into Scores. A high-level example should be ~15 lines, not ~400 lines of imperative orchestration.

Scores must be idempotent. Running twice = same result as once. Use create-or-update, handle "already exists" gracefully.

Scores must not depend on execution order. Declare capability requirements via trait bounds, don't assume another Score ran first. If Score B needs what Score A provides, Score B should declare that capability as a trait bound.

See docs/guides/writing-a-score.md for the full guide.

Deploy Architecture (ADR-023)

The Score-Topology-Interpret pattern above tells you how to describe a deployment. The rules below tell you how to ship one. These are non-negotiable.

Deploy with Scores, not handrolled manifests. No k8s_openapi::api::* structs outside of Score::interpret bodies. CLIs, examples, and test harnesses all compose *Score types — they never reimplement deploys. If you find yourself building Deployment / Service / ConfigMap structs in a test harness, stop: that's the YAML-mud-pit anti-pattern in Rust clothing. Reach for the existing Score, or write a missing Score in the right deploy crate.

E2E uses the same Scores as production. Only the Topology instance changes (local k3d, remote OKD, bare-metal HA). A test harness is a Score-composer running against a test Topology. If e2e needs something prod doesn't, add the knob to the Score — don't fork the manifest in the harness.

One Score per deployable component. Composition is the user-facing primitive: MyAppScore pulls in PostgresScore, HttpServerScore, etc. Don't build monolithic "deploy everything" Scores; build small testable ones and compose upward.

Deploy returns only after smoke-test success. Every Score owns a readiness + smoke-test contract that the framework runs and blocks on. helm install && hope is the anti-pattern harmony exists to fix. Convergence errors must be actionable in the style of rustc's error messages, not "exit code 1 from helm". (The implementation shape of the smoke-test contract is open; the principle is locked in.)

Deploy logic lives in a *-deploy crate that depends on both harmony and the runtime crate. Runtime binaries (the thing that ships to constrained devices and to in-cluster pods) stay free of the harmony dep. Pattern: harmony_agent/deploy, fleet/harmony-fleet-deploy. Each app area gets one deploy crate that holds every component's Score plus a main.rs driven by harmony_cli that selects which component to deploy.

Topologies are compile-time, selected at runtime. A deploy binary statically lists its supported topologies; the user picks one at deploy time. Adding a new topology backend is a rebuild — that's an acceptable cost because dynamic-discovery topologies like K8sAnywhere already cover "any physical place that runs k8s". No Box<dyn Topology> plugin loaders.

Extend Scores with companions, not API changes. New capabilities the framework wants to attach to Scores (planning, dry-run, observability, eventually smoke-test) default to a companion type or trait that wraps a Score rather than a new method on Score/Interpret. The base public API stays simple.

CLI: hybrid, staged. Today (B): first-party tools ship as separate harmony-* binaries built on the existing harmony_cli crate. Tomorrow (C): a top-level harmony binary discovers harmony-* plugin binaries on $PATH (kubectl-style). The plugin protocol is not in scope for any current PR — dedicated future effort.

Use thiserror almost everywhere; anyhow only at binary glue. Library code, public crate boundaries, anything callers might want to match on — typed errors via thiserror. anyhow is reserved for main.rs-level glue where the error is just printed.

See docs/adr/023-deploy-architecture.md for the full rationale, including what's explicitly deferred (Score derive macro, Score registry, plugin CLI discovery, inventory redesign, smoke-test contract shape).

Conventions

Rust edition 2024, resolver v2
Conventional commits: feat:, fix:, chore:, docs:, refactor:
Small PRs: max ~200 lines (excluding generated code), single-purpose
License: GNU AGPL v3
Quality bar: This framework demands high-quality engineering. The type system is a feature, not a burden. Leverage it. Prefer compile-time guarantees over runtime checks. Abstractions should be domain-level, not provider-specific.

12 KiB Raw Permalink Blame History