diff --git a/AGENTS.md b/AGENTS.md index 43847e1f..bf5525be 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,171 +1,94 @@ -# CLAUDE.md +# AGENTS.md -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. +The engineering contract for this repo. Read it before any change. -## Build & Test Commands +## The bar: minimal, DRY, no bloat — non-negotiable -```bash -# Full CI check (check + fmt + clippy + test) -./build/check.sh +Match the implementation to the **real** complexity of the problem, and stop there. The shortest correct solution wins. If 200 lines do a 20-line job, it's wrong — rewrite it. A senior calling your code "overcomplicated" is a failed review. -# Individual commands -cargo check --all-targets --all-features --keep-going -cargo fmt --check # Check formatting -cargo clippy # Lint -cargo test # Run all tests +- **DRY in the purest sense.** Every piece of knowledge has exactly one authoritative representation. The same condition written twice — even in unrelated parts of the system — is a defect. *The discriminator:* the same check at two trust boundaries (untrusted frontend + trusted backend) is **not** duplication — those are distinct axes of the problem. The same check inside one trust domain (app layer + DB layer, both trusted) **is**. Always ask: does this axis exist in the problem, or only in my solution? +- **No speculative abstraction.** Solve the case in front of you, not a hypothetical future. Introduce an abstraction at the *second* real instance, not the first (Rule of Three). A struct/trait/enum/`AppSpec` at n=1 that is longer than what it replaces is a defect. YAGNI. +- **No ceremony.** No struct where a field works; no trait where a function works; no error enum where `anyhow` at the binary edge works; no test that exists only to exercise a helper you added to enable that test; no config knob nothing sets; no job/feature wired for an axis that doesn't exist yet. +- **Core logic stays in one cohesive home.** Never leak or scatter a concern across helpers, wrappers, and `util` modules. Don't reimplement in a test harness, CLI, or example what a Score (or any owning module) already does — compose it. +- **Comments: WHY only.** A non-obvious constraint, an invariant, a workaround. Never narrate WHAT the code does. If deleting a comment loses nothing, it was bloat. No multi-paragraph docstrings. +- **This governs everything an agent emits** — code, tests, docs, commit messages, and chat replies alike. Concise, information-dense, unbloated. Here brevity is correctness, not style. -# Run a single test -cargo test -p +Formal frame: minimize *accidental* complexity (Brooks); the floor is the problem's intrinsic (Kolmogorov) complexity. Over-duplication and over-abstraction are the **same** error — a description longer than the problem warrants. -# Run a specific example -cargo run -p +## What Harmony is -# Build the mdbook documentation -mdbook build -``` +Orchestration for **decentralized micro datacenters** — small clusters in homes, offices, and community spaces instead of hyperscalers. It replaces the Terraform/Ansible/Helm "YAML mud pit" (config validated at runtime, fragmented across tools, failing at 3 AM) with one Rust codebase where the compiler catches infrastructure misconfigurations before anything deploys. Infrastructure-as-real-code, not a wrapper around existing tools. -## What Harmony Is +## Score-Topology-Interpret — the core pattern -Harmony is the orchestration framework powering NationTech's vision of **decentralized micro datacenters** — small computing clusters deployed in homes, offices, and community spaces instead of hyperscaler facilities. The goal: make computing cleaner, more resilient, locally beneficial, and resistant to centralized points of failure (including geopolitical threats). +- **Score** — declarative desired state; a struct generic over `T: Topology`. Serializable, cloneable, idempotent. +- **Topology** — what an environment can do; exposes capabilities as traits (`DnsServer`, `K8sclient`, `HelmCommand`, `LoadBalancer`, …). E.g. `K8sAnywhereTopology` (local k3d or any cluster), `HAClusterTopology` (bare-metal HA). +- **Interpret** — translates a Score into operations against a Topology's capabilities; returns an `Outcome` (SUCCESS, NOOP, FAILURE, RUNNING, QUEUED, BLOCKED). -Harmony exists because existing IaC tools (Terraform, Ansible, Helm) are trapped in a **YAML mud pit**: static configuration files validated only at runtime, fragmented across tools, with errors surfacing at 3 AM instead of at compile time. Harmony replaces this entire class of tools with a single Rust codebase where **the compiler catches infrastructure misconfigurations before anything is deployed**. - -This is not a wrapper around existing tools. It is a paradigm shift: infrastructure-as-real-code with compile-time safety guarantees that no YAML/HCL/DSL-based tool can provide. - -## The Score-Topology-Interpret Pattern - -This is the core design pattern. Understand it before touching the codebase. - -**Score** — declarative desired state. A Rust struct generic over `T: Topology` that describes *what* you want (e.g., "a PostgreSQL cluster", "DNS records for these hosts"). Scores are serializable, cloneable, idempotent. - -**Topology** — infrastructure capabilities. Represents *where* things run and *what the environment can do*. Exposes capabilities as traits (`DnsServer`, `K8sclient`, `HelmCommand`, `LoadBalancer`, `Firewall`, etc.). Examples: `K8sAnywhereTopology` (local K3D or any K8s cluster), `HAClusterTopology` (bare-metal HA with redundant firewalls/switches). - -**Interpret** — execution glue. Translates a Score into concrete operations against a Topology's capabilities. Returns an `Outcome` (SUCCESS, NOOP, FAILURE, RUNNING, QUEUED, BLOCKED). - -**The key insight — compile-time safety through trait bounds:** +Compile-time safety through trait bounds: ```rust impl Score for DnsScore { ... } ``` -The compiler rejects any attempt to use `DnsScore` with a Topology that doesn't implement `DnsServer` and `DhcpServer`. Invalid infrastructure configurations become compilation errors, not runtime surprises. +A Topology missing a required capability is a compile error, not a runtime surprise. Higher-order topologies compose via blanket impls: if `T: PostgreSQL` then `FailoverTopology: PostgreSQL` automatically — zero boilerplate. -**Higher-order topologies** compose transparently: -- `FailoverTopology` — primary/replica orchestration -- `DecentralizedTopology` — multi-site coordination +## Capability & Score rules (non-negotiable) -If `T: PostgreSQL`, then `FailoverTopology: PostgreSQL` automatically via blanket impls. Zero boilerplate. +- **Capabilities are industry concepts, not tools.** `DnsServer`, not `OpnsenseDns`; `SecretVault`, not `OpenbaoStore`. Test: if you can swap the backend without rewriting any Score, the boundary is right. *Exception:* when the developer codes to the implementation — `PostgreSQL`, not `Database`, because you write PG-specific SQL. +- **Scores encapsulate operational complexity.** Init sequences, retries, distro quirks live inside the Score. A high-level example is ~15 lines, not ~400 of imperative orchestration. +- **Idempotent and order-independent.** Twice = once. Declare needs via trait bounds; never assume another Score ran first. -## Architecture (Hexagonal) +See `docs/guides/writing-a-score.md`. -``` -harmony/src/ -├── domain/ # Core domain — the heart of the framework -│ ├── score.rs # Score trait (desired state) -│ ├── topology/ # Topology trait + implementations -│ ├── interpret/ # Interpret trait + InterpretName enum (25+ variants) -│ ├── inventory/ # Physical infrastructure metadata (hosts, switches, mgmt interfaces) -│ ├── executors/ # Executor trait definitions -│ └── maestro/ # Orchestration engine (registers scores, manages topology state, executes) -├── infra/ # Infrastructure adapters (driven ports) -│ ├── opnsense/ # OPNsense firewall adapter -│ ├── brocade.rs # Brocade switch adapter -│ ├── kube.rs # Kubernetes executor -│ └── sqlx.rs # Database executor -└── modules/ # Concrete deployment modules (23+) - ├── k8s/ # Kubernetes (namespaces, deployments, ingress) - ├── postgresql/ # CloudNativePG clusters + multi-site failover - ├── okd/ # OpenShift bare-metal from scratch - ├── helm/ # Helm chart inflation → vanilla K8s YAML - ├── opnsense/ # OPNsense (DHCP, DNS, etc.) - ├── monitoring/ # Prometheus, Alertmanager, Grafana - ├── kvm/ # KVM virtual machine management - ├── network/ # Network services (iPXE, TFTP, bonds) - └── ... -``` +## Deploy architecture (ADR-023, non-negotiable) -Domain types to know: `Inventory` (read-only physical infra context), `Maestro` (orchestrator — calls `topology.ensure_ready()` then executes scores), `Outcome` / `InterpretError` (execution results). +- **Deploy with Scores, not handrolled manifests.** No `k8s_openapi::api::*` structs outside `Score::interpret`. CLIs, examples, **and test harnesses** compose `*Score` types. Building a `Deployment`/`Service`/`ConfigMap` in a harness is the mud pit in Rust clothing — reach for the Score, or write the missing one. +- **E2E uses the same Scores as prod.** Only the `Topology` changes. If e2e needs something prod doesn't, add a knob to the Score — don't fork the manifest. +- **One Score per deployable component; compose upward** (`MyAppScore` pulls in `PostgresScore`, …). No monolithic "deploy everything" Scores. +- **Deploy returns only after smoke-test success.** `helm install && hope` is the anti-pattern Harmony exists to kill. Convergence errors read like `rustc`, not "exit 1 from helm". (Contract shape open; principle locked.) +- **Deploy logic lives in a `*-deploy` crate** depending on both `harmony` and the runtime crate. Runtime binaries stay free of the `harmony` dep. One deploy crate per app area. +- **Topologies are compile-time, selected at runtime.** A deploy binary lists its topologies; a new backend is a rebuild. No `Box` loaders. +- **Extend Scores with companions, not API changes.** New cross-cutting capabilities (dry-run, observability, smoke-test) wrap a Score; the base `Score`/`Interpret` API stays small. +- **`thiserror` in libraries; `anyhow` only at binary glue** (`main.rs`-level, where the error is just printed). -## Key Crates +See `docs/adr/023-deploy-architecture.md`. + +## Architecture (hexagonal, ADR-002) + +`harmony/src/` — `domain/` (`score.rs`, `topology/`, `interpret/`, `inventory/`, `maestro/`) is the framework core; `infra/` (opnsense, brocade, kube, sqlx) holds driven adapters; `modules/` (k8s, postgresql, okd, helm, monitoring, kvm, network, …) holds concrete deployment modules. The domain stays isolated from adapters. + +## Key crates | Crate | Purpose | |---|---| | `harmony` | Core framework: domain, infra adapters, deployment modules | | `harmony_cli` | CLI + optional TUI (`--features tui`) | -| `harmony_config` | Unified config+secret management (env → SQLite → OpenBao → interactive prompt) | -| `harmony_secret` / `harmony_secret_derive` | Secret backends (LocalFile, OpenBao, Infisical) | +| `harmony_config` | Unified config+secret management (env → SQLite → OpenBao → prompt) | +| `harmony_secret` / `_derive` | Secret backends (LocalFile, OpenBao, Infisical) | | `harmony_execution` | Execution engine | | `harmony_agent` / `harmony_inventory_agent` | Persistent agent framework (NATS JetStream mesh), hardware discovery | | `harmony_assets` | Asset management (URLs, local cache, S3) | | `harmony_composer` | Infrastructure composition tool | | `harmony-k8s` | Kubernetes utilities | | `k3d` | Local K3D cluster management | -| `brocade` | Brocade network switch integration | +| `brocade` | Brocade switch integration | +| `opnsense-codegen` / `opnsense-api` | Typed OPNsense client (XML models → Rust); exist because OPNsense ships no typed API. Support crates, not core. | -## OPNsense Crates +## Key ADRs (`docs/adr/`) -The `opnsense-codegen` and `opnsense-api` crates exist because OPNsense's automation ecosystem is poor — no typed API client exists. These are support crates, not the core of Harmony. +001 Rust · 002 hexagonal · 003 capabilities at domain level (no vendor lock-in) · 005 Rust DSL over YAML · 007 k3d default runtime · 009 Helm inflated to vanilla k8s YAML · 015 higher-order topologies via blanket impls · 016 agent mesh on NATS JetStream · 020 unified config+secret · 023 deploy architecture. -- `opnsense-codegen`: XML model files → IR → Rust structs with serde helpers for OPNsense wire format quirks (`opn_bool` for "0"/"1" strings, `opn_u16`/`opn_u32` for string-encoded numbers). Vendor sources are git submodules under `opnsense-codegen/vendor/`. -- `opnsense-api`: Hand-written `OpnsenseClient` + generated model types in `src/generated/`. +## Build & test -## Key Design Decisions (ADRs in docs/adr/) - -- **ADR-001**: Rust chosen for type system, refactoring safety, and performance -- **ADR-002**: Hexagonal architecture — domain isolated from adapters -- **ADR-003**: Infrastructure abstractions at domain level, not provider level (no vendor lock-in) -- **ADR-005**: Custom Rust DSL over YAML/Score-spec — real language, Cargo deps, composable -- **ADR-007**: K3D as default runtime (K8s-certified, lightweight, cross-platform) -- **ADR-009**: Helm charts inflated to vanilla K8s YAML, then deployed via existing code paths -- **ADR-015**: Higher-order topologies via blanket trait impls (zero-cost composition) -- **ADR-016**: Agent-based architecture with NATS JetStream for real-time failover and distributed consensus -- **ADR-020**: Unified config+secret management — Rust struct is the schema, resolution chain: env → store → prompt -- **ADR-023**: Deploy architecture — Scores everywhere (incl. tests), per-component `*-deploy` crates, deploy blocks on smoke-test, topologies are compile-time - -## Capability and Score Design Rules - -**Capabilities are industry concepts, not tools.** A capability trait represents a standard infrastructure need (e.g., `DnsServer`, `LoadBalancer`, `Router`, `CertificateManagement`) that can be fulfilled by different products. OPNsense provides `DnsServer` today; CoreDNS or Route53 could provide it tomorrow. Scores must not break when the backend changes. - -**Exception:** When the developer fundamentally needs to know the implementation. `PostgreSQL` is a capability (not `Database`) because the developer writes PostgreSQL-specific SQL and replication configs. Swapping to MariaDB would break the application, not just the infrastructure. - -**Test:** If you could swap the underlying tool without rewriting any Score that uses the capability, the boundary is correct. - -**Don't name capabilities after tools.** `SecretVault` not `OpenbaoStore`. `IdentityProvider` not `ZitadelAuth`. Think: what is the core developer need that leads to using this tool? - -**Scores encapsulate operational complexity.** Move procedural knowledge (init sequences, retry logic, distribution-specific config) into Scores. A high-level example should be ~15 lines, not ~400 lines of imperative orchestration. - -**Scores must be idempotent.** Running twice = same result as once. Use create-or-update, handle "already exists" gracefully. - -**Scores must not depend on execution order.** Declare capability requirements via trait bounds, don't assume another Score ran first. If Score B needs what Score A provides, Score B should declare that capability as a trait bound. - -See `docs/guides/writing-a-score.md` for the full guide. - -## Deploy Architecture (ADR-023) - -The Score-Topology-Interpret pattern above tells you how to **describe** a deployment. The rules below tell you how to **ship** one. These are non-negotiable. - -**Deploy with Scores, not handrolled manifests.** No `k8s_openapi::api::*` structs outside of `Score::interpret` bodies. CLIs, examples, and **test harnesses** all compose `*Score` types — they never reimplement deploys. If you find yourself building `Deployment` / `Service` / `ConfigMap` structs in a test harness, stop: that's the YAML-mud-pit anti-pattern in Rust clothing. Reach for the existing Score, or write a missing Score in the right deploy crate. - -**E2E uses the same Scores as production.** Only the `Topology` instance changes (local k3d, remote OKD, bare-metal HA). A test harness is a Score-composer running against a test Topology. If e2e needs something prod doesn't, add the knob to the Score — don't fork the manifest in the harness. - -**One Score per deployable component.** Composition is the user-facing primitive: `MyAppScore` pulls in `PostgresScore`, `HttpServerScore`, etc. Don't build monolithic "deploy everything" Scores; build small testable ones and compose upward. - -**Deploy returns only after smoke-test success.** Every Score owns a readiness + smoke-test contract that the framework runs and blocks on. `helm install && hope` is the anti-pattern harmony exists to fix. Convergence errors must be actionable in the style of `rustc`'s error messages, not "exit code 1 from helm". (The implementation shape of the smoke-test contract is open; the principle is locked in.) - -**Deploy logic lives in a `*-deploy` crate** that depends on both `harmony` and the runtime crate. Runtime binaries (the thing that ships to constrained devices and to in-cluster pods) stay free of the `harmony` dep. Pattern: `harmony_agent/deploy`, `fleet/harmony-fleet-deploy`. *Each app area gets one deploy crate that holds every component's Score plus a `main.rs` driven by `harmony_cli` that selects which component to deploy.* - -**Topologies are compile-time, selected at runtime.** A deploy binary statically lists its supported topologies; the user picks one at deploy time. Adding a new topology backend is a rebuild — that's an acceptable cost because dynamic-discovery topologies like `K8sAnywhere` already cover "any physical place that runs k8s". No `Box` plugin loaders. - -**Extend Scores with companions, not API changes.** New capabilities the framework wants to attach to Scores (planning, dry-run, observability, eventually smoke-test) default to a *companion* type or trait that wraps a Score rather than a new method on `Score`/`Interpret`. The base public API stays simple. - -**CLI: hybrid, staged.** Today (B): first-party tools ship as separate `harmony-*` binaries built on the existing `harmony_cli` crate. Tomorrow (C): a top-level `harmony` binary discovers `harmony-*` plugin binaries on `$PATH` (`kubectl`-style). The plugin protocol is **not** in scope for any current PR — dedicated future effort. - -**Use `thiserror` almost everywhere; `anyhow` only at binary glue.** Library code, public crate boundaries, anything callers might want to match on — typed errors via `thiserror`. `anyhow` is reserved for `main.rs`-level glue where the error is just printed. - -See `docs/adr/023-deploy-architecture.md` for the full rationale, including what's explicitly deferred (Score derive macro, Score registry, plugin CLI discovery, inventory redesign, smoke-test contract shape). +```bash +./build/check.sh # full CI: check + fmt + clippy + test +cargo check --all-targets --all-features --keep-going +cargo fmt --check +cargo clippy +cargo test # or: cargo test -p +cargo run -p +mdbook build +``` ## Conventions -- **Rust edition 2024**, resolver v2 -- **Conventional commits**: `feat:`, `fix:`, `chore:`, `docs:`, `refactor:` -- **Small PRs**: max ~200 lines (excluding generated code), single-purpose -- **License**: GNU AGPL v3 -- **Quality bar**: This framework demands high-quality engineering. The type system is a feature, not a burden. Leverage it. Prefer compile-time guarantees over runtime checks. Abstractions should be domain-level, not provider-specific. +Rust edition 2024, resolver v2. Conventional commits (`feat:`/`fix:`/`chore:`/`docs:`/`refactor:`). Small, single-purpose PRs (~200 lines, excluding generated code). License AGPL v3. Lean on the type system — compile-time guarantees over runtime checks; abstractions domain-level, not provider-specific.