wip: Roadmap for config

chore: Harmony short term roadmap
Merge branch 'feat/harmony_assets' into feature/kvm-module
2026-03-22 16:57:36 -04:00 · 2026-03-22 11:43:43 -04:00 · 2026-03-22 11:26:04 -04:00 · 2026-03-22 10:02:10 -04:00 · 2026-03-21 11:10:51 -04:00 · 2026-03-08 21:48:04 -04:00
49 changed files with 5367 additions and 71 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -16,6 +16,8 @@ members = [
  "harmony_inventory_agent",
  "harmony_secret_derive",
  "harmony_secret",
+  "examples/kvm_okd_ha_cluster",
+  "examples/example_linux_vm",
  "harmony_config_derive",
  "harmony_config",
  "brocade",
@@ -23,6 +25,7 @@ members = [
  "harmony_agent/deploy",
  "harmony_node_readiness",
  "harmony-k8s",
+  "harmony_assets",
 ]

 [workspace.package]
@@ -37,6 +40,7 @@ derive-new = "0.7"
 async-trait = "0.1"
 tokio = { version = "1.40", features = [
  "io-std",
+  "io-util",
  "fs",
  "macros",
  "rt-multi-thread",
@@ -73,6 +77,7 @@ base64 = "0.22.1"
 tar = "0.4.44"
 lazy_static = "1.5.0"
 directories = "6.0.0"
+futures-util = "0.3"
 thiserror = "2.0.14"
 serde = { version = "1.0.209", features = ["derive", "rc"] }
 serde_json = "1.0.127"
@@ -86,3 +91,4 @@ reqwest = { version = "0.12", features = [
  "json",
 ], default-features = false }
 assertor = "0.0.4"
+tokio-test = "0.4"
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -0,0 +1,29 @@
+# Harmony Roadmap
+
+Six phases to take Harmony from working prototype to production-ready open-source project.
+
+| # | Phase | Status | Depends On | Detail |
+|---|-------|--------|------------|--------|
+| 1 | [Harden `harmony_config`](ROADMAP/01-config-crate.md) | Not started | — | Test every source, add SQLite backend, wire Zitadel + OpenBao, validate zero-setup UX |
+| 2 | [Migrate to `harmony_config`](ROADMAP/02-refactor-harmony-config.md) | Not started | 1 | Replace all 19 `SecretManager` call sites, deprecate direct `harmony_secret` usage |
+| 3 | [Complete `harmony_assets`](ROADMAP/03-assets-crate.md) | Not started | 1, 2 | Test, refactor k3d and OKD to use it, implement `Url::Url`, remove LFS |
+| 4 | [Publish to GitHub](ROADMAP/04-publish-github.md) | Not started | 3 | Clean history, set up GitHub as community hub, CI on self-hosted runners |
+| 5 | [E2E tests: PostgreSQL & RustFS](ROADMAP/05-e2e-tests-simple.md) | Not started | 1 | k3d-based test harness, two passing E2E tests, CI job |
+| 6 | [E2E tests: OKD HA on KVM](ROADMAP/06-e2e-tests-kvm.md) | Not started | 5 | KVM test infrastructure, full OKD installation test, nightly CI |
+
+## Current State (as of branch `feature/kvm-module`)
+
+- `harmony_config` crate exists with `EnvSource`, `LocalFileSource`, `PromptSource`, `StoreSource`. 12 unit tests. **Zero consumers** in workspace — everything still uses `harmony_secret::SecretManager` directly (19 call sites).
+- `harmony_assets` crate exists with `Asset`, `LocalCache`, `LocalStore`, `S3Store`. **No tests. Zero consumers.** The `k3d` crate has its own `DownloadableAsset` with identical functionality and full test coverage.
+- `harmony_secret` has `LocalFileSecretStore`, `OpenbaoSecretStore` (token/userpass only), `InfisicalSecretStore`. Works but no Zitadel OIDC integration.
+- KVM module exists on this branch with `KvmExecutor`, VM lifecycle, ISO download, two examples (`example_linux_vm`, `kvm_okd_ha_cluster`).
+- RustFS module exists on `feat/rustfs` branch (2 commits ahead of master).
+- 39 example crates, **zero E2E tests**. Unit tests pass across workspace (~240 tests).
+- CI runs `cargo check`, `fmt`, `clippy`, `test` on Gitea. No E2E job.
+
+## Guiding Principles
+
+- **Zero-setup first**: A new user clones, runs `cargo run`, gets prompted for config, values persist to local SQLite. No env vars, no external services required.
+- **Progressive disclosure**: Local SQLite → OpenBao → Zitadel SSO. Each layer is opt-in.
+- **Test what ships**: Every example that works should have an E2E test proving it works.
+- **Community over infrastructure**: GitHub for engagement, self-hosted runners for CI.
--- a/ROADMAP/01-config-crate.md
+++ b/ROADMAP/01-config-crate.md
@@ -0,0 +1,177 @@
+# Phase 1: Harden `harmony_config`, Validate UX, Zero-Setup Starting Point
+
+## Goal
+
+Make `harmony_config` production-ready with a seamless first-run experience: clone, run, get prompted, values persist locally. Then progressively add team-scale backends (OpenBao, Zitadel SSO) without changing any calling code.
+
+## Current State
+
+`harmony_config` exists with:
+
+- `Config` trait + `#[derive(Config)]` macro
+- `ConfigManager` with ordered source chain
+- Four `ConfigSource` implementations:
+  - `EnvSource` — reads `HARMONY_CONFIG_{KEY}` env vars
+  - `LocalFileSource` — reads/writes `{key}.json` files from a directory
+  - `PromptSource` — **stub** (returns `None` / no-ops on set)
+  - `StoreSource<S: SecretStore>` — wraps any `harmony_secret::SecretStore` backend
+- 12 unit tests (mock source, env, local file)
+- Global `CONFIG_MANAGER` static with `init()`, `get()`, `get_or_prompt()`, `set()`
+- **Zero workspace consumers** — nothing calls `harmony_config` yet
+
+## Tasks
+
+### 1.1 Add `SqliteSource` as the default zero-setup backend
+
+Replace `LocalFileSource` (JSON files scattered in a directory) with a single SQLite database as the default local backend. `sqlx` with SQLite is already a workspace dependency.
+
+```rust
+// harmony_config/src/source/sqlite.rs
+pub struct SqliteSource {
+    pool: SqlitePool,
+}
+
+impl SqliteSource {
+    /// Opens or creates the database at the given path.
+    /// Creates the `config` table if it doesn't exist.
+    pub async fn open(path: PathBuf) -> Result<Self, ConfigError>
+
+    /// Uses the default Harmony data directory:
+    /// ~/.local/share/harmony/config.db (Linux)
+    pub async fn default() -> Result<Self, ConfigError>
+}
+
+#[async_trait]
+impl ConfigSource for SqliteSource {
+    async fn get(&self, key: &str) -> Result<Option<serde_json::Value>, ConfigError>
+    async fn set(&self, key: &str, value: &serde_json::Value) -> Result<(), ConfigError>
+}
+```
+
+Schema:
+
+```sql
+CREATE TABLE IF NOT EXISTS config (
+    key   TEXT PRIMARY KEY,
+    value TEXT NOT NULL,
+    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+```
+
+**Tests**:
+- `test_sqlite_set_and_get` — round-trip a `TestConfig` struct
+- `test_sqlite_get_returns_none_when_missing` — key not in DB
+- `test_sqlite_overwrites_on_set` — set twice, get returns latest
+- `test_sqlite_concurrent_access` — two tasks writing different keys simultaneously
+- All tests use `tempfile::NamedTempFile` for the DB path
+
+### 1.1.1 Add Config example to show exact DX and confirm functionality
+
+Create `harmony_config/examples` that show how to use config crate with various backends.
+
+Show how to use the derive macros, how to store secrets in a local backend or a zitadel + openbao backend, how to fetch them from environment variables, etc. Explicitely outline the dependencies for examples with dependencies in a comment at the top. Explain how to configure zitadel + openbao for this backend. The local backend should have zero dependency, zero setup, storing its config/secrets with sane defaults.
+
+Also show that a Config with default values will not prompt for values with defaults.
+
+### 1.2 Make `PromptSource` functional
+
+Currently `PromptSource::get()` returns `None` and `set()` is a no-op. Wire it to `interactive_parse::InteractiveParseObj`:
+
+```rust
+#[async_trait]
+impl ConfigSource for PromptSource {
+    async fn get(&self, _key: &str) -> Result<Option<serde_json::Value>, ConfigError> {
+        // PromptSource never "has" a value — it's always a fallback.
+        // The actual prompting happens in ConfigManager::get_or_prompt().
+        Ok(None)
+    }
+
+    async fn set(&self, _key: &str, _value: &serde_json::Value) -> Result<(), ConfigError> {
+        // Prompt source doesn't persist. Other sources in the chain do.
+        Ok(())
+    }
+}
+```
+
+The prompting logic is already in `ConfigManager::get_or_prompt()` via `T::parse_to_obj()`. The `PromptSource` struct exists mainly to hold the `PROMPT_MUTEX` and potentially a custom writer for TUI integration later.
+
+**Key fix**: Ensure `get_or_prompt()` persists the prompted value to the **first writable source** (SQLite), not to all sources. Current code tries all sources — this is wrong for prompt-then-persist because you don't want to write prompted values to env vars.
+
+```rust
+pub async fn get_or_prompt<T: Config>(&self) -> Result<T, ConfigError> {
+    match self.get::<T>().await {
+        Ok(config) => Ok(config),
+        Err(ConfigError::NotFound { .. }) => {
+            let config = T::parse_to_obj()
+                .map_err(|e| ConfigError::PromptError(e.to_string()))?;
+            let value = serde_json::to_value(&config)
+                .map_err(|e| ConfigError::Serialization { key: T::KEY.to_string(), source: e })?;
+
+            // Persist to the first source that accepts writes (skip EnvSource)
+            for source in &self.sources {
+                if source.set(T::KEY, &value).await.is_ok() {
+                    break;
+                }
+            }
+            Ok(config)
+        }
+        Err(e) => Err(e),
+    }
+}
+```
+
+**Tests**:
+- `test_get_or_prompt_persists_to_first_writable_source` — mock source chain where first source is read-only, second is writable. Verify prompted value lands in second source.
+
+### 1.3 Integration test: full resolution chain
+
+Test the complete priority chain: env > sqlite > prompt.
+
+```rust
+#[tokio::test]
+async fn test_full_resolution_chain() {
+    // 1. No env var, no SQLite entry → prompting would happen
+    //    (test with mock/pre-seeded source instead of real stdin)
+    // 2. Set in SQLite → get() returns SQLite value
+    // 3. Set env var → get() returns env value (overrides SQLite)
+    // 4. Remove env var → get() falls back to SQLite
+}
+
+#[tokio::test]
+async fn test_branch_switching_scenario() {
+    // Simulate: struct shape changes between branches.
+    // Old value in SQLite doesn't match new struct.
+    // get() should return Deserialization error.
+    // get_or_prompt() should re-prompt and overwrite.
+}
+```
+
+### 1.4 Validate Zitadel + OpenBao integration path
+
+This is not about building the full OIDC flow yet. It's about validating that the architecture supports it by adding `StoreSource<OpenbaoSecretStore>` to the source chain.
+
+**Validate**:
+- `ConfigManager::new(vec![EnvSource, SqliteSource, StoreSource<Openbao>])` compiles and works
+- When OpenBao is unreachable, the chain falls through to SQLite gracefully (no panic)
+- When OpenBao has the value, it's returned and SQLite is not queried
+
+**Document** the target Zitadel OIDC flow as an ADR (RFC 8628 device authorization grant), but don't implement it yet. The `StoreSource` wrapping OpenBao with JWT auth is the integration point — Zitadel provides the JWT, OpenBao validates it.
+
+### 1.5 UX validation checklist
+
+Before this phase is done, manually verify:
+
+- [ ] `cargo run --example postgresql` with no env vars → prompts for nothing (postgresql doesn't use secrets yet, but the config system initializes cleanly)
+- [ ] An example that uses `SecretManager` today (e.g., `brocade_snmp_server`) → when migrated to `harmony_config`, first run prompts, second run reads from SQLite
+- [ ] Setting `HARMONY_CONFIG_BrocadeSwitchAuth='{"host":"...","user":"...","password":"..."}'` → skips prompt, uses env value
+- [ ] Deleting `~/.local/share/harmony/config.db` → re-prompts on next run
+
+## Deliverables
+
+- [ ] `SqliteSource` implementation with tests
+- [ ] Functional `PromptSource` (or validated that current `get_or_prompt` flow is correct)
+- [ ] Fix `get_or_prompt` to persist to first writable source, not all sources
+- [ ] Integration tests for full resolution chain
+- [ ] Branch-switching deserialization failure test
+- [ ] `StoreSource<OpenbaoSecretStore>` integration validated (compiles, graceful fallback)
+- [ ] ADR for Zitadel OIDC target architecture
--- a/ROADMAP/02-refactor-harmony-config.md
+++ b/ROADMAP/02-refactor-harmony-config.md
@@ -0,0 +1,112 @@
+# Phase 2: Migrate Workspace to `harmony_config`
+
+## Goal
+
+Replace every direct `harmony_secret::SecretManager` call with `harmony_config` equivalents. After this phase, modules and examples depend only on `harmony_config`. `harmony_secret` becomes an internal implementation detail behind `StoreSource`.
+
+## Current State
+
+19 call sites use `SecretManager::get_or_prompt::<T>()` across:
+
+| Location | Secret Types | Call Sites |
+|----------|-------------|------------|
+| `harmony/src/modules/brocade/brocade_snmp.rs` | `BrocadeSnmpAuth`, `BrocadeSwitchAuth` | 2 |
+| `harmony/src/modules/nats/score_nats_k8s.rs` | `NatsAdmin` | 1 |
+| `harmony/src/modules/okd/bootstrap_02_bootstrap.rs` | `RedhatSecret`, `SshKeyPair` | 2 |
+| `harmony/src/modules/application/features/monitoring.rs` | `NtfyAuth` | 1 |
+| `brocade/examples/main.rs` | `BrocadeSwitchAuth` | 1 |
+| `examples/okd_installation/src/main.rs` + `topology.rs` | `SshKeyPair`, `BrocadeSwitchAuth`, `OPNSenseFirewallConfig` | 3 |
+| `examples/okd_pxe/src/main.rs` + `topology.rs` | `SshKeyPair`, `BrocadeSwitchAuth`, `OPNSenseFirewallCredentials` | 3 |
+| `examples/opnsense/src/main.rs` | `OPNSenseFirewallCredentials` | 1 |
+| `examples/sttest/src/main.rs` + `topology.rs` | `SshKeyPair`, `OPNSenseFirewallConfig` | 2 |
+| `examples/opnsense_node_exporter/` | (has dep but unclear usage) | ~1 |
+| `examples/okd_cluster_alerts/` | (has dep but unclear usage) | ~1 |
+| `examples/brocade_snmp_server/` | (has dep but unclear usage) | ~1 |
+
+## Tasks
+
+### 2.1 Bootstrap `harmony_config` in CLI and TUI entry points
+
+Add `harmony_config::init()` as the first thing that happens in `harmony_cli::run()` and `harmony_tui::run()`.
+
+```rust
+// harmony_cli/src/lib.rs — inside run()
+pub async fn run<T: Topology + Send + Sync + 'static>(
+    inventory: Inventory,
+    topology: T,
+    scores: Vec<Box<dyn Score<T>>>,
+    args_struct: Option<Args>,
+) -> Result<(), Box<dyn std::error::Error>> {
+    // Initialize config system with default source chain
+    let sqlite = Arc::new(SqliteSource::default().await?);
+    let env = Arc::new(EnvSource);
+    harmony_config::init(vec![env, sqlite]).await;
+
+    // ... rest of run()
+}
+```
+
+This replaces the implicit `SecretManager` lazy initialization that currently happens on first `get_or_prompt` call.
+
+### 2.2 Migrate each secret type from `Secret` to `Config`
+
+For each secret struct, change:
+
+```rust
+// Before
+use harmony_secret::Secret;
+#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, InteractiveParse, Secret)]
+struct BrocadeSwitchAuth { ... }
+
+// After
+use harmony_config::Config;
+#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, InteractiveParse, Config)]
+struct BrocadeSwitchAuth { ... }
+```
+
+At each call site, change:
+
+```rust
+// Before
+let config = SecretManager::get_or_prompt::<BrocadeSwitchAuth>().await.unwrap();
+
+// After
+let config = harmony_config::get_or_prompt::<BrocadeSwitchAuth>().await.unwrap();
+```
+
+### 2.3 Migration order (low risk to high risk)
+
+1. **`brocade/examples/main.rs`** — 1 call site, isolated example, easy to test manually
+2. **`examples/opnsense/src/main.rs`** — 1 call site, isolated
+3. **`harmony/src/modules/brocade/brocade_snmp.rs`** — 2 call sites, core module but straightforward
+4. **`harmony/src/modules/nats/score_nats_k8s.rs`** — 1 call site
+5. **`harmony/src/modules/application/features/monitoring.rs`** — 1 call site
+6. **`examples/sttest/`** — 2 call sites, has both main.rs and topology.rs patterns
+7. **`examples/okd_installation/`** — 3 call sites, complex topology setup
+8. **`examples/okd_pxe/`** — 3 call sites, similar to okd_installation
+9. **`harmony/src/modules/okd/bootstrap_02_bootstrap.rs`** — 2 call sites, critical OKD bootstrap path
+
+### 2.4 Remove `harmony_secret` from direct dependencies
+
+After all call sites are migrated:
+
+1. Remove `harmony_secret` from `Cargo.toml` of: `harmony`, `brocade`, and all examples that had it
+2. `harmony_config` keeps `harmony_secret` as a dependency (for `StoreSource`)
+3. The `Secret` trait and `SecretManager` remain in `harmony_secret` but are not used directly anymore
+
+### 2.5 Backward compatibility for existing local secrets
+
+Users who already have secrets stored via `LocalFileSecretStore` (JSON files in `~/.local/share/harmony/secrets/`) need a migration path:
+
+- On first run after upgrade, if SQLite has no entry for a key but the old JSON file exists, read from JSON and write to SQLite
+- Or: add `LocalFileSource` as a fallback source at the end of the chain (read-only) for one release cycle
+- Log a deprecation warning when reading from old JSON files
+
+## Deliverables
+
+- [ ] `harmony_config::init()` called in `harmony_cli::run()` and `harmony_tui::run()`
+- [ ] All 19 call sites migrated from `SecretManager` to `harmony_config`
+- [ ] `harmony_secret` removed from direct dependencies of `harmony`, `brocade`, and all examples
+- [ ] Backward compatibility for existing local JSON secrets
+- [ ] All existing unit tests still pass
+- [ ] Manual verification: one migrated example works end-to-end (prompt → persist → read)
--- a/ROADMAP/03-assets-crate.md
+++ b/ROADMAP/03-assets-crate.md
@@ -0,0 +1,141 @@
+# Phase 3: Complete `harmony_assets`, Refactor Consumers
+
+## Goal
+
+Make `harmony_assets` the single way to manage downloadable binaries and images across Harmony. Eliminate `k3d::DownloadableAsset` duplication, implement `Url::Url` in OPNsense infra, remove LFS-tracked files from git.
+
+## Current State
+
+- `harmony_assets` exists with `Asset`, `LocalCache`, `LocalStore`, `S3Store` (behind feature flag). CLI with `upload`, `download`, `checksum`, `verify` commands. **No tests. Zero consumers.**
+- `k3d/src/downloadable_asset.rs` has the same functionality with full test coverage (httptest mock server, checksum verification, cache hit, 404 handling, checksum failure).
+- `Url::Url` variant in `harmony_types/src/net.rs` exists but is `todo!()` in OPNsense TFTP and HTTP infra layers.
+- OKD modules hardcode `./data/...` paths (`bootstrap_02_bootstrap.rs:84-88`, `ipxe.rs:73`).
+- `data/` directory contains ~3GB of LFS-tracked files (OKD binaries, PXE images, SCOS images).
+
+## Tasks
+
+### 3.1 Port k3d tests to `harmony_assets`
+
+The k3d crate has 5 well-written tests in `downloadable_asset.rs`. Port them to test `harmony_assets::LocalStore`:
+
+```rust
+// harmony_assets/tests/local_store.rs (or in src/ as unit tests)
+
+#[tokio::test]
+async fn test_fetch_downloads_and_verifies_checksum() {
+    // Start httptest server serving a known file
+    // Create Asset with URL pointing to mock server
+    // Fetch via LocalStore
+    // Assert file exists at expected cache path
+    // Assert checksum matches
+}
+
+#[tokio::test]
+async fn test_fetch_returns_cached_file_when_present() {
+    // Pre-populate cache with correct file
+    // Fetch — assert no HTTP request made (mock server not hit)
+}
+
+#[tokio::test]
+async fn test_fetch_fails_on_404() { ... }
+
+#[tokio::test]
+async fn test_fetch_fails_on_checksum_mismatch() { ... }
+
+#[tokio::test]
+async fn test_fetch_with_progress_callback() {
+    // Assert progress callback is called with (bytes_received, total_size)
+}
+```
+
+Add `httptest` to `[dev-dependencies]` of `harmony_assets`.
+
+### 3.2 Refactor `k3d` to use `harmony_assets`
+
+Replace `k3d/src/downloadable_asset.rs` with calls to `harmony_assets`:
+
+```rust
+// k3d/src/lib.rs — in download_latest_release()
+use harmony_assets::{Asset, LocalCache, LocalStore, ChecksumAlgo};
+
+let asset = Asset::new(
+    binary_url,
+    checksum,
+    ChecksumAlgo::SHA256,
+    K3D_BIN_FILE_NAME.to_string(),
+);
+let cache = LocalCache::new(self.base_dir.clone());
+let store = LocalStore::new();
+let path = store.fetch(&asset, &cache, None).await
+    .map_err(|e| format!("Failed to download k3d: {}", e))?;
+```
+
+Delete `k3d/src/downloadable_asset.rs`. Update k3d's `Cargo.toml` to depend on `harmony_assets`.
+
+### 3.3 Define asset metadata as config structs
+
+Following `plan.md` Phase 2, create typed config for OKD assets using `harmony_config`:
+
+```rust
+// harmony/src/modules/okd/config.rs
+#[derive(Config, Serialize, Deserialize, JsonSchema, InteractiveParse)]
+struct OkdInstallerConfig {
+    pub openshift_install_url: String,
+    pub openshift_install_sha256: String,
+    pub scos_kernel_url: String,
+    pub scos_kernel_sha256: String,
+    pub scos_initramfs_url: String,
+    pub scos_initramfs_sha256: String,
+    pub scos_rootfs_url: String,
+    pub scos_rootfs_sha256: String,
+}
+```
+
+First run prompts for URLs/checksums (or uses compiled-in defaults). Values persist to SQLite. Can be overridden via env vars or OpenBao.
+
+### 3.4 Implement `Url::Url` in OPNsense infra layer
+
+In `harmony/src/infra/opnsense/http.rs` and `tftp.rs`, implement the `Url::Url(url)` match arm:
+
+```rust
+// Instead of SCP-ing files to OPNsense:
+// SSH into OPNsense, run: fetch -o /usr/local/http/{path} {url}
+// (FreeBSD-native HTTP client, no extra deps on OPNsense)
+```
+
+This eliminates the manual `scp` workaround and the `inquire::Confirm` prompts in `ipxe.rs:126` and `bootstrap_02_bootstrap.rs:230`.
+
+### 3.5 Refactor OKD modules to use assets + config
+
+In `bootstrap_02_bootstrap.rs`:
+- `openshift-install`: Resolve `OkdInstallerConfig` from `harmony_config`, download via `harmony_assets`, invoke from cache.
+- SCOS images: Pass `Url::Url(scos_kernel_url)` etc. to `StaticFilesHttpScore`. OPNsense fetches from S3 directly.
+- Remove `oc` and `kubectl` from `data/okd/bin/` (never used by code).
+
+In `ipxe.rs`:
+- Replace the folder-to-serve SCP workaround with individual `Url::Url` entries.
+- Remove the `inquire::Confirm` SCP prompts.
+
+### 3.6 Upload assets to S3
+
+- Upload all current `data/` binaries to Ceph S3 bucket with path scheme: `harmony-assets/okd/v{version}/openshift-install`, `harmony-assets/pxe/centos-stream-9/install.img`, etc.
+- Set public-read ACL or configure presigned URL generation.
+- Record S3 URLs and SHA256 checksums as defaults in the config structs.
+
+### 3.7 Remove LFS, clean git
+
+- Remove all LFS-tracked files from the repo.
+- Update `.gitattributes` to remove LFS filters.
+- Keep `data/` in `.gitignore` (it becomes a local cache directory).
+- Optionally use `git filter-repo` or BFG to strip LFS objects from history (required before Phase 4 GitHub publish).
+
+## Deliverables
+
+- [ ] `harmony_assets` has tests ported from k3d pattern (5+ tests with httptest)
+- [ ] `k3d::DownloadableAsset` replaced by `harmony_assets` usage
+- [ ] `OkdInstallerConfig` struct using `harmony_config`
+- [ ] `Url::Url` implemented in OPNsense HTTP and TFTP infra
+- [ ] OKD bootstrap refactored to use lazy-download pattern
+- [ ] Assets uploaded to S3 with documented URLs/checksums
+- [ ] LFS removed, git history cleaned
+- [ ] Repo size small enough for GitHub (~code + templates only)
--- a/ROADMAP/04-publish-github.md
+++ b/ROADMAP/04-publish-github.md
@@ -0,0 +1,110 @@
+# Phase 4: Publish to GitHub
+
+## Goal
+
+Make Harmony publicly available on GitHub as the primary community hub for issues, pull requests, and discussions. CI runs on self-hosted runners.
+
+## Prerequisites
+
+- Phase 3 complete: LFS removed, git history cleaned, repo is small
+- README polished with quick-start, architecture overview, examples
+- All existing tests pass
+
+## Tasks
+
+### 4.1 Clean git history
+
+```bash
+# Option A: git filter-repo (preferred)
+git filter-repo --strip-blobs-bigger-than 10M
+
+# Option B: BFG Repo Cleaner
+bfg --strip-blobs-bigger-than 10M
+git reflog expire --expire=now --all
+git gc --prune=now --aggressive
+```
+
+Verify final repo size is reasonable (target: <50MB including all code, docs, templates).
+
+### 4.2 Create GitHub repository
+
+- Create `NationTech/harmony` (or chosen org/name) on GitHub
+- Push cleaned repo as initial commit
+- Set default branch to `main` (rename from `master` if desired)
+
+### 4.3 Set up CI on self-hosted runners
+
+GitHub is the community hub, but CI runs on your own infrastructure. Options:
+
+**Option A: GitHub Actions with self-hosted runners**
+- Register your Gitea runner machines as GitHub Actions self-hosted runners
+- Port `.gitea/workflows/check.yml` to `.github/workflows/check.yml`
+- Same Docker image (`hub.nationtech.io/harmony/harmony_composer:latest`), same commands
+- Pro: native GitHub PR checks, no external service needed
+- Con: runners need outbound access to GitHub API
+
+**Option B: External CI (Woodpecker, Drone, Jenkins)**
+- Use any CI that supports webhooks from GitHub
+- Report status back to GitHub via commit status API / checks API
+- Pro: fully self-hosted, no GitHub dependency for builds
+- Con: extra integration work
+
+**Option C: Keep Gitea CI, mirror from GitHub**
+- GitHub repo has a webhook that triggers Gitea CI on push
+- Gitea reports back to GitHub via commit status API
+- Pro: no migration of CI config
+- Con: fragile webhook chain
+
+**Recommendation**: Option A. GitHub Actions self-hosted runners are straightforward and give the best contributor UX (native PR checks). The workflow files are nearly identical to Gitea workflows.
+
+```yaml
+# .github/workflows/check.yml
+name: Check
+on: [push, pull_request]
+jobs:
+  check:
+    runs-on: self-hosted
+    container:
+      image: hub.nationtech.io/harmony/harmony_composer:latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: bash build/check.sh
+```
+
+### 4.4 Polish documentation
+
+- **README.md**: Quick-start (clone → run → get prompted → see result), architecture diagram (Score → Interpret → Topology), link to docs and examples
+- **CONTRIBUTING.md**: Already exists. Review for GitHub-specific guidance (fork workflow, PR template)
+- **docs/**: Already comprehensive. Verify links work on GitHub rendering
+- **Examples**: Ensure each example has a one-line description in its `Cargo.toml` and a comment block in `main.rs`
+
+### 4.5 License and legal
+
+- Verify workspace `license` field in root `Cargo.toml` is set correctly
+- Add `LICENSE` file at repo root if not present
+- Scan for any proprietary dependencies or hardcoded internal URLs
+
+### 4.6 GitHub repository configuration
+
+- Branch protection on `main`: require PR review, require CI to pass
+- Issue templates: bug report, feature request
+- PR template: checklist (tests pass, docs updated, etc.)
+- Topics/tags: `rust`, `infrastructure-as-code`, `kubernetes`, `orchestration`, `bare-metal`
+- Repository description: "Infrastructure orchestration framework. Declare what you want (Score), describe your infrastructure (Topology), let Harmony figure out how."
+
+### 4.7 Gitea as internal mirror
+
+- Set up Gitea to mirror from GitHub (pull mirror)
+- Internal CI can continue running on Gitea for private/experimental branches
+- Public contributions flow through GitHub
+
+## Deliverables
+
+- [ ] Git history cleaned, repo size <50MB
+- [ ] Public GitHub repository created
+- [ ] CI running on self-hosted runners with GitHub Actions
+- [ ] Branch protection enabled
+- [ ] README polished with quick-start guide
+- [ ] Issue and PR templates created
+- [ ] LICENSE file present
+- [ ] Gitea configured as mirror
--- a/ROADMAP/05-e2e-tests-simple.md
+++ b/ROADMAP/05-e2e-tests-simple.md
@@ -0,0 +1,255 @@
+# Phase 5: E2E Tests for PostgreSQL & RustFS
+
+## Goal
+
+Establish an automated E2E test pipeline that proves working examples actually work. Start with the two simplest k8s-based examples: PostgreSQL and RustFS.
+
+## Prerequisites
+
+- Phase 1 complete (config crate works, bootstrap is clean)
+- `feat/rustfs` branch merged
+
+## Architecture
+
+### Test harness: `tests/e2e/`
+
+A dedicated workspace member crate at `tests/e2e/` that contains:
+
+1. **Shared k3d utilities** — create/destroy clusters, wait for readiness
+2. **Per-example test modules** — each example gets a `#[tokio::test]` function
+3. **Assertion helpers** — wait for pods, check CRDs exist, verify services
+
+```
+tests/
+  e2e/
+    Cargo.toml
+    src/
+      lib.rs          # Shared test utilities
+      k3d.rs          # k3d cluster lifecycle
+      k8s_assert.rs   # K8s assertion helpers
+    tests/
+      postgresql.rs   # PostgreSQL E2E test
+      rustfs.rs       # RustFS E2E test
+```
+
+### k3d cluster lifecycle
+
+```rust
+// tests/e2e/src/k3d.rs
+use k3d_rs::K3d;
+
+pub struct TestCluster {
+    pub name: String,
+    pub k3d: K3d,
+    pub client: kube::Client,
+    reuse: bool,
+}
+
+impl TestCluster {
+    /// Creates a k3d cluster for testing.
+    /// If HARMONY_E2E_REUSE_CLUSTER=1, reuses existing cluster.
+    pub async fn ensure(name: &str) -> Result<Self, String> {
+        let reuse = std::env::var("HARMONY_E2E_REUSE_CLUSTER")
+            .map(|v| v == "1")
+            .unwrap_or(false);
+
+        let base_dir = PathBuf::from("/tmp/harmony-e2e");
+        let k3d = K3d::new(base_dir, Some(name.to_string()));
+
+        let client = k3d.ensure_installed().await?;
+
+        Ok(Self { name: name.to_string(), k3d, client, reuse })
+    }
+
+    /// Returns the kubeconfig path for this cluster.
+    pub fn kubeconfig_path(&self) -> String { ... }
+}
+
+impl Drop for TestCluster {
+    fn drop(&mut self) {
+        if !self.reuse {
+            // Best-effort cleanup
+            let _ = self.k3d.run_k3d_command(["cluster", "delete", &self.name]);
+        }
+    }
+}
+```
+
+### K8s assertion helpers
+
+```rust
+// tests/e2e/src/k8s_assert.rs
+
+/// Wait until a pod matching the label selector is Running in the namespace.
+/// Times out after `timeout` duration.
+pub async fn wait_for_pod_running(
+    client: &kube::Client,
+    namespace: &str,
+    label_selector: &str,
+    timeout: Duration,
+) -> Result<(), String>
+
+/// Assert a CRD instance exists.
+pub async fn assert_resource_exists<K: kube::Resource>(
+    client: &kube::Client,
+    name: &str,
+    namespace: Option<&str>,
+) -> Result<(), String>
+
+/// Install a Helm chart. Returns when all pods in the release are running.
+pub async fn helm_install(
+    release_name: &str,
+    chart: &str,
+    namespace: &str,
+    repo_url: Option<&str>,
+    timeout: Duration,
+) -> Result<(), String>
+```
+
+## Tasks
+
+### 5.1 Create the `tests/e2e/` crate
+
+Add to workspace `Cargo.toml`:
+
+```toml
+[workspace]
+members = [
+  # ... existing members
+  "tests/e2e",
+]
+```
+
+`tests/e2e/Cargo.toml`:
+
+```toml
+[package]
+name = "harmony-e2e-tests"
+edition = "2024"
+publish = false
+
+[dependencies]
+harmony = { path = "../../harmony" }
+harmony_cli = { path = "../../harmony_cli" }
+harmony_types = { path = "../../harmony_types" }
+k3d_rs = { path = "../../k3d", package = "k3d_rs" }
+kube = { workspace = true }
+k8s-openapi = { workspace = true }
+tokio = { workspace = true }
+log = { workspace = true }
+env_logger = { workspace = true }
+
+[dev-dependencies]
+pretty_assertions = { workspace = true }
+```
+
+### 5.2 PostgreSQL E2E test
+
+```rust
+// tests/e2e/tests/postgresql.rs
+use harmony::modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig};
+use harmony::topology::K8sAnywhereTopology;
+use harmony::inventory::Inventory;
+use harmony::maestro::Maestro;
+
+#[tokio::test]
+async fn test_postgresql_deploys_on_k3d() {
+    let cluster = TestCluster::ensure("harmony-e2e-pg").await.unwrap();
+
+    // Install CNPG operator via Helm
+    // (K8sAnywhereTopology::ensure_ready() now handles this since
+    //  commit e1183ef "K8s postgresql score now ensures cnpg is installed")
+    // But we may need the Helm chart for non-OKD:
+    helm_install(
+        "cnpg",
+        "cloudnative-pg",
+        "cnpg-system",
+        Some("https://cloudnative-pg.github.io/charts"),
+        Duration::from_secs(120),
+    ).await.unwrap();
+
+    // Configure topology pointing to test cluster
+    let config = K8sAnywhereConfig {
+        kubeconfig: Some(cluster.kubeconfig_path()),
+        use_local_k3d: false,
+        autoinstall: false,
+        use_system_kubeconfig: false,
+        harmony_profile: "dev".to_string(),
+        k8s_context: None,
+    };
+    let topology = K8sAnywhereTopology::with_config(config);
+
+    // Create and run the score
+    let score = PostgreSQLScore {
+        config: PostgreSQLConfig {
+            cluster_name: "e2e-test-pg".to_string(),
+            namespace: "e2e-pg-test".to_string(),
+            ..Default::default()
+        },
+    };
+
+    let mut maestro = Maestro::initialize(Inventory::autoload(), topology).await.unwrap();
+    maestro.register_all(vec![Box::new(score)]);
+
+    let scores = maestro.scores().read().unwrap().first().unwrap().clone_box();
+    let result = maestro.interpret(scores).await;
+    assert!(result.is_ok(), "PostgreSQL score failed: {:?}", result.err());
+
+    // Assert: CNPG Cluster resource exists
+    // (the Cluster CRD is applied — pod readiness may take longer)
+    let client = cluster.client.clone();
+    // ... assert Cluster CRD exists in e2e-pg-test namespace
+}
+```
+
+### 5.3 RustFS E2E test
+
+Similar structure. Details depend on what the RustFS score deploys (likely a Helm chart or k8s resources for MinIO/RustFS).
+
+```rust
+#[tokio::test]
+async fn test_rustfs_deploys_on_k3d() {
+    let cluster = TestCluster::ensure("harmony-e2e-rustfs").await.unwrap();
+    // ... similar pattern: configure topology, create score, interpret, assert
+}
+```
+
+### 5.4 CI job for E2E tests
+
+New workflow file (Gitea or GitHub Actions):
+
+```yaml
+# .gitea/workflows/e2e.yml (or .github/workflows/e2e.yml)
+name: E2E Tests
+on:
+  push:
+    branches: [master, main]
+  # Don't run on every PR — too slow. Run on label or manual trigger.
+  workflow_dispatch:
+
+jobs:
+  e2e:
+    runs-on: self-hosted  # Must have Docker available for k3d
+    timeout-minutes: 15
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install k3d
+        run: curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+
+      - name: Run E2E tests
+        run: cargo test -p harmony-e2e-tests -- --test-threads=1
+        env:
+          RUST_LOG: info
+```
+
+Note `--test-threads=1`: E2E tests create k3d clusters and should not run in parallel (port conflicts, resource contention).
+
+## Deliverables
+
+- [ ] `tests/e2e/` crate added to workspace
+- [ ] Shared test utilities: `TestCluster`, `wait_for_pod_running`, `helm_install`
+- [ ] PostgreSQL E2E test passing
+- [ ] RustFS E2E test passing (after `feat/rustfs` merge)
+- [ ] CI job running E2E tests on push to main
+- [ ] `HARMONY_E2E_REUSE_CLUSTER=1` for fast local iteration
--- a/ROADMAP/06-e2e-tests-kvm.md
+++ b/ROADMAP/06-e2e-tests-kvm.md
@@ -0,0 +1,214 @@
+# Phase 6: E2E Tests for OKD HA Cluster on KVM
+
+## Goal
+
+Prove the full OKD bare-metal installation flow works end-to-end using KVM virtual machines. This is the ultimate validation of Harmony's core value proposition: declare an OKD cluster, point it at infrastructure, watch it materialize.
+
+## Prerequisites
+
+- Phase 5 complete (test harness exists, k3d tests passing)
+- `feature/kvm-module` merged to main
+- A CI runner with libvirt/KVM access and nested virtualization support
+
+## Architecture
+
+The KVM branch already has a `kvm_okd_ha_cluster` example that creates:
+
+```
+                        Host bridge (WAN)
+                              |
+                    +--------------------+
+                    |  OPNsense          |  192.168.100.1
+                    |  gateway + PXE     |
+                    +--------+-----------+
+                             |
+                  harmonylan (192.168.100.0/24)
+                   +---------+---------+---------+---------+
+                   |         |         |         |         |
+              +----+---+ +---+---+ +---+---+ +---+---+ +--+----+
+              | cp0    | | cp1   | | cp2   | |worker0| |worker1|
+              | .10    | | .11   | | .12   | | .20   | | .21   |
+              +--------+ +-------+ +-------+ +-------+ +---+---+
+                                                            |
+                                                      +-----+----+
+                                                      | worker2  |
+                                                      | .22      |
+                                                      +----------+
+```
+
+The test needs to orchestrate this entire setup, wait for OKD to converge, and assert the cluster is healthy.
+
+## Tasks
+
+### 6.1 Start with `example_linux_vm` — the simplest KVM test
+
+Before tackling the full OKD stack, validate the KVM module itself with the simplest possible test:
+
+```rust
+// tests/e2e/tests/kvm_linux_vm.rs
+
+#[tokio::test]
+#[ignore] // Requires libvirt access — run with: cargo test -- --ignored
+async fn test_linux_vm_boots_from_iso() {
+    let executor = KvmExecutor::from_env().unwrap();
+
+    // Create isolated network
+    let network = NetworkConfig {
+        name: "e2e-test-net".to_string(),
+        bridge: "virbr200".to_string(),
+        // ...
+    };
+    executor.ensure_network(&network).await.unwrap();
+
+    // Define and start VM
+    let vm_config = VmConfig::builder("e2e-linux-test")
+        .vcpus(1)
+        .memory_gb(1)
+        .disk(5)
+        .network(NetworkRef::named("e2e-test-net"))
+        .cdrom("https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso")
+        .boot_order([BootDevice::Cdrom, BootDevice::Disk])
+        .build();
+
+    executor.ensure_vm(&vm_config).await.unwrap();
+    executor.start_vm("e2e-linux-test").await.unwrap();
+
+    // Assert VM is running
+    let status = executor.vm_status("e2e-linux-test").await.unwrap();
+    assert_eq!(status, VmStatus::Running);
+
+    // Cleanup
+    executor.destroy_vm("e2e-linux-test").await.unwrap();
+    executor.undefine_vm("e2e-linux-test").await.unwrap();
+    executor.delete_network("e2e-test-net").await.unwrap();
+}
+```
+
+This test validates:
+- ISO download works (via `harmony_assets` if refactored, or built-in KVM module download)
+- libvirt XML generation is correct
+- VM lifecycle (define → start → status → destroy → undefine)
+- Network creation/deletion
+
+### 6.2 OKD HA Cluster E2E test
+
+The full integration test. This is long-running (30-60 minutes) and should only run nightly or on-demand.
+
+```rust
+// tests/e2e/tests/kvm_okd_ha.rs
+
+#[tokio::test]
+#[ignore] // Requires KVM + significant resources. Run nightly.
+async fn test_okd_ha_cluster_on_kvm() {
+    // 1. Create virtual infrastructure
+    //    - OPNsense gateway VM
+    //    - 3 control plane VMs
+    //    - 3 worker VMs
+    //    - Virtual network (harmonylan)
+
+    // 2. Run OKD installation scores
+    //    (the kvm_okd_ha_cluster example, but as a test)
+
+    // 3. Wait for OKD API server to become reachable
+    //    - Poll https://api.okd.harmonylan:6443 until it responds
+    //    - Timeout: 30 minutes
+
+    // 4. Assert cluster health
+    //    - All nodes in Ready state
+    //    - ClusterVersion reports Available=True
+    //    - Sample workload (nginx) deploys and pod reaches Running
+
+    // 5. Cleanup
+    //    - Destroy all VMs
+    //    - Delete virtual networks
+    //    - Clean up disk images
+}
+```
+
+### 6.3 CI runner requirements
+
+The KVM E2E test needs a runner with:
+
+- **Hardware**: 32GB+ RAM, 8+ CPU cores, 100GB+ disk
+- **Software**: libvirt, QEMU/KVM, `virsh`, nested virtualization enabled
+- **Network**: Outbound internet access (to download ISOs, OKD images)
+- **Permissions**: User in `libvirt` group, or root access
+
+Options:
+- **Dedicated bare-metal machine** registered as a self-hosted GitHub Actions runner
+- **Cloud VM with nested virt** (e.g., GCP n2-standard-8 with `--enable-nested-virtualization`)
+- **Manual trigger only** — developer runs locally, CI just tracks pass/fail
+
+### 6.4 Nightly CI job
+
+```yaml
+# .github/workflows/e2e-kvm.yml
+name: E2E KVM Tests
+on:
+  schedule:
+    - cron: '0 2 * * *'  # 2 AM daily
+  workflow_dispatch:       # Manual trigger
+
+jobs:
+  kvm-tests:
+    runs-on: [self-hosted, kvm]  # Label for KVM-capable runners
+    timeout-minutes: 90
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Run KVM E2E tests
+        run: cargo test -p harmony-e2e-tests -- --ignored --test-threads=1
+        env:
+          RUST_LOG: info
+          HARMONY_KVM_URI: qemu:///system
+
+      - name: Cleanup VMs on failure
+        if: failure()
+        run: |
+          virsh list --all --name | grep e2e | xargs -I {} virsh destroy {} || true
+          virsh list --all --name | grep e2e | xargs -I {} virsh undefine {} --remove-all-storage || true
+```
+
+### 6.5 Test resource management
+
+KVM tests create real resources that must be cleaned up even on failure. Implement a test fixture pattern:
+
+```rust
+struct KvmTestFixture {
+    executor: KvmExecutor,
+    vms: Vec<String>,
+    networks: Vec<String>,
+}
+
+impl KvmTestFixture {
+    fn track_vm(&mut self, name: &str) { self.vms.push(name.to_string()); }
+    fn track_network(&mut self, name: &str) { self.networks.push(name.to_string()); }
+}
+
+impl Drop for KvmTestFixture {
+    fn drop(&mut self) {
+        // Best-effort cleanup of all tracked resources
+        for vm in &self.vms {
+            let _ = std::process::Command::new("virsh")
+                .args(["destroy", vm]).output();
+            let _ = std::process::Command::new("virsh")
+                .args(["undefine", vm, "--remove-all-storage"]).output();
+        }
+        for net in &self.networks {
+            let _ = std::process::Command::new("virsh")
+                .args(["net-destroy", net]).output();
+            let _ = std::process::Command::new("virsh")
+                .args(["net-undefine", net]).output();
+        }
+    }
+}
+```
+
+## Deliverables
+
+- [ ] `test_linux_vm_boots_from_iso` — passing KVM smoke test
+- [ ] `test_okd_ha_cluster_on_kvm` — full OKD installation test
+- [ ] `KvmTestFixture` with resource cleanup on test failure
+- [ ] Nightly CI job on KVM-capable runner
+- [ ] Force-cleanup script for leaked VMs/networks
+- [ ] Documentation: how to set up a KVM runner for E2E tests
--- a/docs/adr/017-2-reviewed-staleness-detection-algorithm.md
+++ b/docs/adr/017-2-reviewed-staleness-detection-algorithm.md
@@ -0,0 +1,238 @@
+Here are some rough notes on the previous design :
+
+- We found an issue where there could be primary flapping when network latency is larger than the primary self fencing timeout.
+    - e.g. network latency to get nats ack is 30 seconds (extreme but can happen), and self-fencing happens after 50 seconds. Then at second 50 self-fencing would occur, and then at second 60 ack comes in. At this point we reject the ack as already failed because of timeout. Self fencing happens. But then network latency comes back down to 5 seconds and lets one successful heartbeat through, this means the primary comes back to healthy, and the same thing repeats, so the primary flaps.
+    - At least this does not cause split brain since the replica never times out and wins the leadership write since we validate strict write ordering and we force consensus on writes.
+
+Also, we were seeing that the implementation became more complex. There is a lot of timers to handle and that becomes hard to reason about for edge cases.
+
+So, we came up with a slightly different approach, inspired by k8s liveness probes.
+
+We now want to use a failure and success threshold counter . However, on the replica side, all we can do is use a timer. The timer we can use is time since last primary heartbeat jetstream metadata timestamp. We could also try and mitigate clock skew by measuring time between internal clock and jetstream metadata timestamp when writing our own heartbeat (not for now, but worth thinking about, though I feel like it is useless).
+
+So the current working design is this :
+
+configure :
+- number of consecutive success to mark the node as UP
+- number of consecutive failures to mark the node as DOWN
+- note that success/failure must be consecutive. One success in a row of failures is enough to keep service up. This allows for various configuration profiles, from very stict availability to very lenient depending on the number of failure tolerated and success required to keep the service up.
+    - failure_threshold at 100 will let a service fail (or timeout) 99/100 and stay up
+    - success_threshold at 100 will not bring back up a service until it has succeeded 100 heartbeat in a row
+    - failure threshold at 1 will fail the service at the slightest network latency spike/packet loss
+    - success threshold at 1 will bring the service up very quickly and may cause flapping in unstable network conditions
+
+
+```
+# heartbeat session log
+# failure threshold  : 3
+# success threshold : 2
+
+STATUS UP :
+t=1 probe : fail f=1 s=0
+t=2 probe : fail : f=2 s=0
+t=3 probe : ok f=0 s=1
+t=4 probe : fail f=1 s=0
+```
+
+Scenario : 
+
+failure threshold = 2
+heartbeat timeout = 1s
+total before fencing = 2 * 1 = 2s
+
+staleness detection timer = 2*total before fencing
+
+can we do this simple multiplication that staleness detection timer (time the replica waits since the last primary heartbeat before promoting itself) is double the  time the replica will take before starting the fencing process.
+
+---
+
+### Context
+We are designing a **Staleness-Based Failover Algorithm** for the Harmony Agent. The goal is to manage High Availability (HA) for stateful workloads (like PostgreSQL) across decentralized, variable-quality networks ("Micro Data Centers").
+
+We are moving away from complex, synchronized clocks in favor of a **Counter-Based Liveness** approach (inspired by Kubernetes probes) for the Primary, and a **Time-Based Watchdog** for the Replica.
+
+### 1. The Algorithm
+
+#### The Primary (Self-Health & Fencing)
+The Primary validates its own "License to Operate" via a heartbeat loop.
+*   **Loop:** Every `heartbeat_interval` (e.g., 1s), it attempts to write a heartbeat to NATS and check the local DB.
+*   **Counters:** It maintains `consecutive_failures` and `consecutive_successes`.
+*   **State Transition:**
+    *   **To UNHEALTHY:** If `consecutive_failures >= failure_threshold`, the Primary **Fences Self** (stops DB, releases locks).
+    *   **To HEALTHY:** If `consecutive_successes >= success_threshold`, the Primary **Un-fences** (starts DB, acquires locks).
+*   **Reset Logic:** A single success resets the failure counter to 0, and vice versa.
+
+#### The Replica (Staleness Detection)
+The Replica acts as a passive watchdog observing the NATS stream.
+*   **Calculation:** It calculates a `MaxStaleness` timeout.
+    $$ \text{MaxStaleness} = (\text{failure\_threshold} \times \text{heartbeat\_interval}) \times \text{SafetyMultiplier} $$
+    *(We use a SafetyMultiplier of 2 to ensure the Primary has definitely fenced itself before we take over).*
+*   **Action:** If `Time.now() - LastPrimaryHeartbeat > MaxStaleness`, the Replica assumes the Primary is dead and **Promotes Self**.
+
+---
+
+### 2. Configuration Trade-offs
+
+The separation of `success` and `failure` thresholds allows us to tune the "personality" of the cluster.
+
+#### Scenario A: The "Nervous" Cluster (High Sensitivity)
+*   **Config:** `failure_threshold: 1`, `success_threshold: 1`
+*   **Behavior:** Fails over immediately upon a single missed packet or slow disk write.
+*   **Pros:** Maximum availability for perfect networks.
+*   **Cons:** **High Flapping Risk.** In a residential network, a microwave turning on might cause a failover.
+
+#### Scenario B: The "Tank" Cluster (High Stability)
+*   **Config:** `failure_threshold: 10`, `success_threshold: 1`
+*   **Behavior:** The node must be consistently broken for 10 seconds (assuming 1s interval) to give up.
+*   **Pros:** Extremely stable on bad networks (e.g., Starlink, 4G). Ignores transient spikes.
+*   **Cons:** **Slow Failover.** Users experience 10+ seconds of downtime before the Replica even *thinks* about taking over.
+
+#### Scenario C: The "Sticky" Cluster (Hysteresis)
+*   **Config:** `failure_threshold: 5`, `success_threshold: 5`
+*   **Behavior:** Hard to kill, hard to bring back.
+*   **Pros:** Prevents "Yo-Yo" effects. If a node fails, it must prove it is *really* stable (5 clean checks in a row) before re-joining the cluster.
+
+---
+
+### 3. Failure Modes & Behavior Analysis
+
+Here is how the algorithm handles specific edge cases:
+
+#### Case 1: Immediate Outage (Power Cut / Kernel Panic)
+*   **Event:** Primary vanishes instantly. No more writes to NATS.
+*   **Primary:** Does nothing (it's dead).
+*   **Replica:** Sees the `LastPrimaryHeartbeat` timestamp age. Once it crosses `MaxStaleness`, it promotes itself.
+*   **Outcome:** Clean failover after the timeout duration.
+
+#### Case 2: Network Instability (Packet Loss / Jitter)
+*   **Event:** The Primary fails to write to NATS for 2 cycles due to Wi-Fi interference, then succeeds on the 3rd.
+*   **Config:** `failure_threshold: 5`.
+*   **Primary:**
+    *   $t=1$: Fail (Counter=1)
+    *   $t=2$: Fail (Counter=2)
+    *   $t=3$: Success (Counter resets to 0). **State remains HEALTHY.**
+*   **Replica:** Sees a gap in heartbeats but the timestamp never exceeds `MaxStaleness`.
+*   **Outcome:** No downtime, no failover. The system correctly identified this as noise, not failure.
+
+#### Case 3: High Latency (The "Slow Death")
+*   **Event:** Primary is under heavy load; heartbeats take 1.5s to complete (interval is 1s).
+*   **Primary:** The `timeout` on the heartbeat logic triggers. `consecutive_failures` rises. Eventually, it hits `failure_threshold` and fences itself to prevent data corruption.
+*   **Replica:** Sees the heartbeats stop (or arrive too late). The timestamp ages out.
+*   **Outcome:** Primary fences self -> Replica waits for safety buffer -> Replica promotes. **Split-brain is avoided** because the Primary killed itself *before* the Replica acted (due to the SafetyMultiplier).
+
+#### Case 4: Replica Network Partition
+*   **Event:** Replica loses internet connection; Primary is fine.
+*   **Replica:** Sees `LastPrimaryHeartbeat` age out (because it can't reach NATS). It *wants* to promote itself.
+*   **Constraint:** To promote, the Replica must write to NATS. Since it is partitioned, the NATS write fails.
+*   **Outcome:** The Replica remains in Standby (or fails to promote). The Primary continues serving traffic. **Cluster integrity is preserved.**
+
+
+----
+
+
+### Context & Use Case
+We are implementing a High Availability (HA) Failover Strategy for decentralized "Micro Data Centers." The core challenge is managing stateful workloads (PostgreSQL) over unreliable networks.
+
+We solve this using a **Local Fencing First** approach, backed by **NATS JetStream Strict Ordering** for the final promotion authority.
+
+In CAP theorem terms, we are developing a CP system, intentionally sacrificing availability. In practical terms, we expect an average of two primary outages per year, with a failover delay of around 2 minutes. This translates to an uptime of over five nines. To be precise, 2 outages * 2 minutes = 4 minutes per year = 99.99924% uptime.
+
+### The Algorithm: Local Fencing & Remote Promotion
+
+The safety (data consistency) of the system relies on the time gap between the **Primary giving up (Fencing)** and the **Replica taking over (Promotion)**.
+
+To avoid clock skew issues between agents and datastore (nats), all timestamps comparisons will be done using jetstream metadata. I.E. a harmony agent will never use `Instant::now()` to get a timestamp, it will use `my_last_heartbeat.metadata.timestamp` (conceptually).
+
+#### 1. Configuration
+*   `heartbeat_timeout` (e.g., 1s): Max time allowed for a NATS write/DB check.
+*   `failure_threshold` (e.g., 2): Consecutive failures before self-fencing.
+*   `failover_timeout` (e.g., 5s): Time since last NATS update of Primary heartbeat before Replica promotes.
+    * This timeout must be carefully configured to allow enough time for the primary to fence itself (after `heartbeat_timeout * failure_threshold`) BEFORE the replica gets promoted to avoid a split brain with two primaries.
+        * Implementing this will rely on the actual deployment configuration. For example, a CNPG based PostgreSQL cluster might require a longer gap (such as 30s) than other technologies.
+    * Expires when `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
+
+#### 2. The Primary (Self-Preservation)
+
+The Primary is aggressive about killing itself.
+
+*   It attempts a heartbeat.
+*   If the network latency > `heartbeat_timeout`, the attempt is **cancelled locally** because the heartbeat did not make it back in time.
+*   This counts as a failure and increments the `consecutive_failures` counter.
+*   If `consecutive_failures` hit the threshold, **FENCING occurs immediately**. The database is stopped.
+
+This means that the Primary will fence itself after `heartbeat_timeout * failure_threshold`.
+
+#### 3. The Replica (The Watchdog)
+
+The Replica is patient.
+
+*   It watches the NATS stream to measure if `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
+*   It only attempts promotion if the `failover_timeout` (5s) has passed.
+*   **Crucial:** Careful configuration of the failover_timeout is required. This is the only way to avoid a split brain in case of a network partition where the Primary cannot write its heartbeats in time anymore.
+    * In short, `failover_timeout` should be tuned to be `heartbeat_timeout * failure_threshold + safety_margin`. This `safety_margin` will vary by use case. For example, a CNPG cluster may need 30 seconds to demote a Primary to Replica when fencing is triggered, so `safety_margin` should be at least 30s in that setup.
+
+Since we forcibly fail timeouts after `heartbeat_timeout`, we are guaranteed that the primary will have **started** the fencing process after `heartbeat_timeout * failure_threshold`.
+
+But, in a network split scenario where the failed primary is still accessible by clients but cannot write its heartbeat successfully, there is no way to know if the demotion has actually **completed**.
+
+For example, in a CNPG cluster, the failed Primary agent will attempt to change the CNPG cluster state to read-only. But if anything fails after that attempt (permission error, k8s api failure, CNPG bug, etc) it is possible that the PostgreSQL instance keeps accepting writes.
+
+While this is not a theoretical failure of the agent's algorithm, this is a practical failure where data corruption occurs.
+
+This can be fixed by detecting the demotion failure and escalating the fencing procedure aggressiveness. Harmony being an infrastructure orchestrator, it can easily exert radical measures if given the proper credentials, such as forcibly powering off a server, disconnecting its network in the switch configuration, forcibly kill a pod/container/process, etc.
+
+However, these details are out of scope of this algorithm, as they simply fall under the "fencing procedure".
+
+The implementation of the fencing procedure itself is not relevant. This algorithm's responsibility stops at calling the fencing procedure in the appropriate situation.
+
+#### 4. The Demotion Handshake (Return to Normalcy)
+
+When the original Primary recovers:
+
+1.  It becomes healthy locally but sees `current_primary = Replica`. It waits.
+2.  The Replica (current leader) detects the Original Primary is back (via NATS heartbeats).
+3.  Replica performs a **Clean Demotion**:
+    *   Stops DB.
+    *   Writes `current_primary = None` to NATS.
+4.  Original Primary sees `current_primary = None` and can launch the promotion procedure.
+
+Depending on the implementation, the promotion procedure may require a transition phase. Typically, for a PostgreSQL use case the promoting primary will make sure it has caught up on WAL replication before starting to accept writes.
+
+---
+
+### Failure Modes & Behavior Analysis
+
+#### Case 1: Immediate Outage (Power Cut)
+
+*   **Primary:** Dies instantly. Fencing is implicit (machine is off).
+*   **Replica:** Waits for `failover_timeout` (5s). Sees staleness. Promotes self.
+*   **Outcome:** Clean failover after 5s.
+
+// TODO detail what happens when the primary comes back up. We will likely have to tie PostgreSQL's lifecycle (liveness/readiness probes) with the agent to ensure it does not come back up as primary.
+
+#### Case 2: High Network Latency on the Primary (The "Split Brain" Trap)
+
+*   **Scenario:** Network latency spikes to 5s on the Primary, still below `heartbeat_timeout` on the Replica.
+*   **T=0 to T=2 (Primary):** Tries to write. Latency (5s) > Timeout (1s). Fails twice.
+*   **T=2 (Primary):** `consecutive_failures` = 2. **Primary Fences Self.** (Service is DOWN).
+*   **T=2 to T=5 (Cluster):** **Read-Only Phase.** No Primary exists.
+*   **T=5 (Replica):** `failover_timeout` reached. Replica promotes self.
+*   **Outcome:** Safe failover. The "Read-Only Gap" (T=2 to T=5) ensures no Split Brain occurred.
+
+#### Case 3: Replica Network Lag (False Positive)
+
+*   **Scenario:** Replica has high latency, greater than `failover_timeout`; Primary is fine.
+*   **Replica:** Thinks Primary is dead. Tries to promote by setting `cluster_state.current_primary = replica_id`.
+*   **NATS:** Rejects the write because the Primary is still updating the sequence numbers successfully.
+*   **Outcome:** Promotion denied. Primary stays leader.
+
+#### Case 4: Network Instability (Flapping)
+
+*   **Scenario:** Intermittent packet loss.
+*   **Primary:** Fails 1 heartbeat, succeeds the next. `consecutive_failures` resets.
+*   **Replica:** Sees a slight delay in updates, but never reaches `failover_timeout`.
+*   **Outcome:** No Fencing, No Promotion. System rides out the noise.
+
+## Contextual notes
+
+* Clock skew : Tokio relies on monotonic clocks. This means that `tokio::time::sleep(...)` will not be affected by system clock corrections (such as NTP). But monotonic clocks are known to jump forward in some cases such as VM live migrations. This could mean a false timeout of a single heartbeat. If `failure_threshold = 1`, this can mean a false negative on the nodes' health, and a potentially useless demotion.
--- a/docs/adr/017-3-revised-staleness-inspired-by-kubernetes.md
+++ b/docs/adr/017-3-revised-staleness-inspired-by-kubernetes.md
@@ -0,0 +1,107 @@
+### Context & Use Case
+We are implementing a High Availability (HA) Failover Strategy for decentralized "Micro Data Centers." The core challenge is managing stateful workloads (PostgreSQL) over unreliable networks.
+
+We solve this using a **Local Fencing First** approach, backed by **NATS JetStream Strict Ordering** for the final promotion authority.
+
+In CAP theorem terms, we are developing a CP system, intentionally sacrificing availability. In practical terms, we expect an average of two primary outages per year, with a failover delay of around 2 minutes. This translates to an uptime of over five nines. To be precise, 2 outages * 2 minutes = 4 minutes per year = 99.99924% uptime.
+
+### The Algorithm: Local Fencing & Remote Promotion
+
+The safety (data consistency) of the system relies on the time gap between the **Primary giving up (Fencing)** and the **Replica taking over (Promotion)**.
+
+To avoid clock skew issues between agents and datastore (nats), all timestamps comparisons will be done using jetstream metadata. I.E. a harmony agent will never use `Instant::now()` to get a timestamp, it will use `my_last_heartbeat.metadata.timestamp` (conceptually).
+
+#### 1. Configuration
+*   `heartbeat_timeout` (e.g., 1s): Max time allowed for a NATS write/DB check.
+*   `failure_threshold` (e.g., 2): Consecutive failures before self-fencing.
+*   `failover_timeout` (e.g., 5s): Time since last NATS update of Primary heartbeat before Replica promotes.
+    * This timeout must be carefully configured to allow enough time for the primary to fence itself (after `heartbeat_timeout * failure_threshold`) BEFORE the replica gets promoted to avoid a split brain with two primaries.
+        * Implementing this will rely on the actual deployment configuration. For example, a CNPG based PostgreSQL cluster might require a longer gap (such as 30s) than other technologies.
+    * Expires when `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
+
+#### 2. The Primary (Self-Preservation)
+
+The Primary is aggressive about killing itself.
+
+*   It attempts a heartbeat.
+*   If the network latency > `heartbeat_timeout`, the attempt is **cancelled locally** because the heartbeat did not make it back in time.
+*   This counts as a failure and increments the `consecutive_failures` counter.
+*   If `consecutive_failures` hit the threshold, **FENCING occurs immediately**. The database is stopped.
+
+This means that the Primary will fence itself after `heartbeat_timeout * failure_threshold`.
+
+#### 3. The Replica (The Watchdog)
+
+The Replica is patient.
+
+*   It watches the NATS stream to measure if `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
+*   It only attempts promotion if the `failover_timeout` (5s) has passed.
+*   **Crucial:** Careful configuration of the failover_timeout is required. This is the only way to avoid a split brain in case of a network partition where the Primary cannot write its heartbeats in time anymore.
+    * In short, `failover_timeout` should be tuned to be `heartbeat_timeout * failure_threshold + safety_margin`. This `safety_margin` will vary by use case. For example, a CNPG cluster may need 30 seconds to demote a Primary to Replica when fencing is triggered, so `safety_margin` should be at least 30s in that setup.
+
+Since we forcibly fail timeouts after `heartbeat_timeout`, we are guaranteed that the primary will have **started** the fencing process after `heartbeat_timeout * failure_threshold`.
+
+But, in a network split scenario where the failed primary is still accessible by clients but cannot write its heartbeat successfully, there is no way to know if the demotion has actually **completed**.
+
+For example, in a CNPG cluster, the failed Primary agent will attempt to change the CNPG cluster state to read-only. But if anything fails after that attempt (permission error, k8s api failure, CNPG bug, etc) it is possible that the PostgreSQL instance keeps accepting writes.
+
+While this is not a theoretical failure of the agent's algorithm, this is a practical failure where data corruption occurs.
+
+This can be fixed by detecting the demotion failure and escalating the fencing procedure aggressiveness. Harmony being an infrastructure orchestrator, it can easily exert radical measures if given the proper credentials, such as forcibly powering off a server, disconnecting its network in the switch configuration, forcibly kill a pod/container/process, etc.
+
+However, these details are out of scope of this algorithm, as they simply fall under the "fencing procedure".
+
+The implementation of the fencing procedure itself is not relevant. This algorithm's responsibility stops at calling the fencing procedure in the appropriate situation.
+
+#### 4. The Demotion Handshake (Return to Normalcy)
+
+When the original Primary recovers:
+
+1.  It becomes healthy locally but sees `current_primary = Replica`. It waits.
+2.  The Replica (current leader) detects the Original Primary is back (via NATS heartbeats).
+3.  Replica performs a **Clean Demotion**:
+    *   Stops DB.
+    *   Writes `current_primary = None` to NATS.
+4.  Original Primary sees `current_primary = None` and can launch the promotion procedure.
+
+Depending on the implementation, the promotion procedure may require a transition phase. Typically, for a PostgreSQL use case the promoting primary will make sure it has caught up on WAL replication before starting to accept writes.
+
+---
+
+### Failure Modes & Behavior Analysis
+
+#### Case 1: Immediate Outage (Power Cut)
+
+*   **Primary:** Dies instantly. Fencing is implicit (machine is off).
+*   **Replica:** Waits for `failover_timeout` (5s). Sees staleness. Promotes self.
+*   **Outcome:** Clean failover after 5s.
+
+// TODO detail what happens when the primary comes back up. We will likely have to tie PostgreSQL's lifecycle (liveness/readiness probes) with the agent to ensure it does not come back up as primary.
+
+#### Case 2: High Network Latency on the Primary (The "Split Brain" Trap)
+
+*   **Scenario:** Network latency spikes to 5s on the Primary, still below `heartbeat_timeout` on the Replica.
+*   **T=0 to T=2 (Primary):** Tries to write. Latency (5s) > Timeout (1s). Fails twice.
+*   **T=2 (Primary):** `consecutive_failures` = 2. **Primary Fences Self.** (Service is DOWN).
+*   **T=2 to T=5 (Cluster):** **Read-Only Phase.** No Primary exists.
+*   **T=5 (Replica):** `failover_timeout` reached. Replica promotes self.
+*   **Outcome:** Safe failover. The "Read-Only Gap" (T=2 to T=5) ensures no Split Brain occurred.
+
+#### Case 3: Replica Network Lag (False Positive)
+
+*   **Scenario:** Replica has high latency, greater than `failover_timeout`; Primary is fine.
+*   **Replica:** Thinks Primary is dead. Tries to promote by setting `cluster_state.current_primary = replica_id`.
+*   **NATS:** Rejects the write because the Primary is still updating the sequence numbers successfully.
+*   **Outcome:** Promotion denied. Primary stays leader.
+
+#### Case 4: Network Instability (Flapping)
+
+*   **Scenario:** Intermittent packet loss.
+*   **Primary:** Fails 1 heartbeat, succeeds the next. `consecutive_failures` resets.
+*   **Replica:** Sees a slight delay in updates, but never reaches `failover_timeout`.
+*   **Outcome:** No Fencing, No Promotion. System rides out the noise.
+
+## Contextual notes
+
+* Clock skew : Tokio relies on monotonic clocks. This means that `tokio::time::sleep(...)` will not be affected by system clock corrections (such as NTP). But monotonic clocks are known to jump forward in some cases such as VM live migrations. This could mean a false timeout of a single heartbeat. If `failure_threshold = 1`, this can mean a false negative on the nodes' health, and a potentially useless demotion.
+* `heartbeat_timeout == heartbeat_interval` : We intentionally do not provide two separate settings for the timeout before considering a heartbeat failed and the interval between heartbeats. It could make sense in some configurations where low network latency is required to have a small `heartbeat_timeout = 50ms` and larger `hartbeat_interval == 2s`, but we do not have a practical use case for it yet. And having timeout larger than interval does not make sense in any situation we can think of at the moment. So we decided to have a single value for both, which makes the algorithm easier to reason about and implement.
--- a/docs/adr/017-staleness-detection-for-failover.md
+++ b/docs/adr/017-staleness-detection-for-failover.md
@@ -0,0 +1,95 @@
+# Architecture Decision Record: Staleness-Based Failover Mechanism & Observability
+
+**Status:** Proposed
+**Date:** 2026-01-09
+**Precedes:** [016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md](https://git.nationtech.io/NationTech/harmony/raw/branch/master/adr/016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md)
+
+## Context
+
+In ADR 016, we established the **Harmony Agent** and the **Global Orchestration Mesh** (powered by NATS JetStream) as the foundation for our decentralized infrastructure. We defined the high-level need for a `FailoverStrategy` that can support both financial consistency (CP) and AI availability (AP).
+
+However, a specific implementation challenge remains: **How do we reliably detect node failure without losing the ability to debug the event later?**
+
+Standard distributed systems often use "Key Expiration" (TTL) for heartbeats. If a key disappears, the node is presumed dead. While simple, this approach is catastrophic for post-mortem analysis. When the key expires, the evidence of *when* and *how* the failure occurred evaporates.
+
+For NationTech’s vision of **Humane Computing**—where micro datacenters might be heating a family home or running a local business—reliability and diagnosability are paramount. If a cluster fails over, we owe it to the user to provide a clear, historical log of exactly what happened. We cannot build a "wonderful future for computers" on ephemeral, untraceable errors.
+
+## Decision
+
+We will implement a **Staleness Detection** mechanism rather than a Key Expiration mechanism. We will leverage NATS JetStream Key-Value (KV) stores with **History Enabled** to create an immutable audit trail of cluster health.
+
+### 1. The "Black Box" Flight Recorder (NATS Configuration)
+We will utilize a persistent NATS KV bucket named `harmony_failover`.
+*   **Storage:** File (Persistent).
+*   **History:** Set to `64` (or higher). This allows us to query the last 64 heartbeat entries to visualize the exact degradation of the primary node before failure.
+*   **TTL:** None. Data never disappears; it only becomes "stale."
+
+### 2. Data Structures
+We will define two primary schemas to manage the state.
+
+
+**A. The Rules of Engagement (`cluster_config`)**
+This persistent key defines the behavior of the mesh. It allows us to tune failover sensitivity dynamically without redeploying the Agent binary.
+
+```json
+{
+  "primary_site_id": "site-a-basement",
+  "replica_site_id": "site-b-cloud",
+  "failover_timeout_ms": 5000,   // Time before Replica takes over
+  "heartbeat_interval_ms": 1000  // Frequency of Primary updates
+}
+```
+
+> **Note :** The location for this configuration data structure is TBD. See https://git.nationtech.io/NationTech/harmony/issues/206
+
+**B. The Heartbeat (`primary_heartbeat`)**
+The Primary writes this; the Replica watches it.
+
+```json
+{
+  "site_id": "site-a-basement",
+  "status": "HEALTHY",
+  "counter": 10452,
+  "timestamp": 1704661549000
+}
+```
+
+### 3. The Failover Algorithm
+
+**The Primary (Site A) Logic:**
+The Primary's ability to write to the mesh is its "License to Operate."
+1.  **Write Loop:** Attempts to write `primary_heartbeat` every `heartbeat_interval_ms`.
+2.  **Self-Preservation (Fencing):** If the write fails (NATS Ack timeout or NATS unreachable), the Primary **immediately self-demotes**. It assumes it is network-isolated. This prevents Split Brain scenarios where a partitioned Primary continues to accept writes while the Replica promotes itself.
+
+**The Replica (Site B) Logic:**
+The Replica acts as the watchdog.
+1.  **Watch:** Subscribes to updates on `primary_heartbeat`.
+2.  **Staleness Check:** Maintains a local timer. Every time a heartbeat arrives, the timer resets.
+3.  **Promotion:** If the timer exceeds `failover_timeout_ms`, the Replica declares the Primary dead and promotes itself to Leader.
+4.  **Yielding:** If the Replica is Leader, but suddenly receives a valid, new heartbeat from the configured `primary_site_id` (indicating the Primary has recovered), the Replica will voluntarily **demote** itself to restore the preferred topology.
+
+## Rationale
+
+**Observability as a First-Class Citizen**
+By keeping the last 64 heartbeats, we can run `nats kv history` to see the exact timeline. Did the Primary stop suddenly (crash)? or did the heartbeats become erratic and slow before stopping (network congestion)? This data is critical for optimizing the "Micro Data Centers" described in our vision, where internet connections in residential areas may vary in quality.
+
+**Energy Efficiency & Resource Optimization**
+NationTech aims to "maximize the value of our energy." A "flapping" cluster (constantly failing over and back) wastes immense energy in data re-synchronization and startup costs. By making the `failover_timeout_ms` configurable via `cluster_config`, we can tune a cluster heating a greenhouse to be less sensitive (slower failover is fine) compared to a cluster running a payment gateway.
+
+**Decentralized Trust**
+This architecture relies on NATS as the consensus engine. If the Primary is part of the NATS majority, it lives. If it isn't, it dies. This removes ambiguity and allows us to scale to thousands of independent sites without a central "God mode" controller managing every single failover.
+
+## Consequences
+
+**Positive**
+*   **Auditability:** Every failover event leaves a permanent trace in the KV history.
+*   **Safety:** The "Write Ack" check on the Primary provides a strong guarantee against Split Brain in `AbsoluteConsistency` mode.
+*   **Dynamic Tuning:** We can adjust timeouts for specific environments (e.g., high-latency satellite links) by updating a JSON key, requiring no downtime.
+
+**Negative**
+*   **Storage Overhead:** Keeping history requires marginally more disk space on the NATS servers, though for 64 small JSON payloads, this is negligible.
+*   **Clock Skew:** While we rely on NATS server-side timestamps for ordering, extreme clock skew on the client side could confuse the debug logs (though not the failover logic itself).
+
+## Alignment with Vision
+This architecture supports the NationTech goal of a **"Beautifully Integrated Design."** It takes the complex, high-stakes problem of distributed consensus and wraps it in a mechanism that is robust enough for enterprise banking yet flexible enough to manage a basement server heating a swimming pool. It bridges the gap between the reliability of Web2 clouds and the decentralized nature of Web3 infrastructure.
+
--- a/docs/coding-guide.md
+++ b/docs/coding-guide.md
@@ -0,0 +1,229 @@
+# Harmony Coding Guide
+
+Harmony is an infrastructure automation framework. It is **code-first and code-only**: operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Good code here means a good operator experience.
+
+### Concrete context
+
+We use here the context of the KVM module to explain the coding style. This will make it very easy to understand and should translate quite well to other modules/contexts managed by Harmony like OPNSense and Kubernetes.
+
+## Core Philosophy
+
+### High-level functions over raw primitives
+
+Callers should not need to know about underlying protocols, XML schemas, or API quirks. A function that deploys a VM should accept meaningful parameters like CPU count, memory, and network name — not XML strings.
+
+```rust
+// Bad: caller constructs XML and passes it to a thin wrapper
+let xml = format!(r#"<domain type='kvm'>...</domain>"#, name, memory_kb, ...);
+executor.create_vm(&xml).await?;
+
+// Good: caller describes intent, the module handles representation
+executor.define_vm(&VmConfig::builder("my-vm")
+    .cpu(4)
+    .memory_gb(8)
+    .disk(DiskConfig::new(50))
+    .network(NetworkRef::named("mylan"))
+    .boot_order([BootDevice::Network, BootDevice::Disk])
+    .build())
+    .await?;
+```
+
+The module owns the XML, the virsh invocations, the API calls — not the caller.
+
+### Use the right abstraction layer
+
+Prefer native library bindings over shelling out to CLI tools. The `virt` crate provides direct libvirt bindings and should be used instead of spawning `virsh` subprocesses.
+
+- CLI subprocess calls are fragile: stdout/stderr parsing, exit codes, quoting, PATH differences
+- Native bindings give typed errors, no temp files, no shell escaping
+- `virt::connect::Connect` opens a connection; `virt::domain::Domain` manages VMs; `virt::network::Network` manages virtual networks
+
+### Keep functions small and well-named
+
+Each function should do one thing. If a function is doing two conceptually separate things, split it. Function names should read like plain English: `ensure_network_active`, `define_vm`, `vm_is_running`.
+
+### Prefer short modules over large files
+
+Group related types and functions by concept. A module that handles one resource (e.g., network, domain, storage) is better than a single file for everything.
+
+---
+
+## Error Handling
+
+### Use `thiserror` for all error types
+
+Define error types with `thiserror::Error`. This removes the boilerplate of implementing `Display` and `std::error::Error` by hand, keeps error messages close to their variants, and makes types easy to extend.
+
+```rust
+// Bad: hand-rolled Display + std::error::Error
+#[derive(Debug)]
+pub enum KVMError {
+    ConnectionError(String),
+    VMNotFound(String),
+}
+
+impl std::fmt::Display for KVMError { ... }
+impl std::error::Error for KVMError {}
+
+// Good: derive Display via thiserror
+#[derive(thiserror::Error, Debug)]
+pub enum KVMError {
+    #[error("connection failed: {0}")]
+    ConnectionFailed(String),
+    #[error("VM not found: {name}")]
+    VmNotFound { name: String },
+}
+```
+
+### Make bubbling errors easy with `?` and `From`
+
+`?` works on any error type for which there is a `From` impl. Add `From` conversions from lower-level errors into your module's error type so callers can use `?` without boilerplate.
+
+With `thiserror`, wrapping a foreign error is one line:
+
+```rust
+#[derive(thiserror::Error, Debug)]
+pub enum KVMError {
+    #[error("libvirt error: {0}")]
+    Libvirt(#[from] virt::error::Error),
+
+    #[error("IO error: {0}")]
+    Io(#[from] std::io::Error),
+}
+```
+
+This means a call that returns `virt::error::Error` can be `?`-propagated into a `Result<_, KVMError>` without any `.map_err(...)`.
+
+### Typed errors over stringly-typed errors
+
+Avoid `Box<dyn Error>` or `String` as error return types in library code. Callers need to distinguish errors programmatically — `KVMError::VmAlreadyExists` is actionable, `"VM already exists: foo"` as a `String` is not.
+
+At binary entry points (e.g., `main`) it is acceptable to convert to `String` or `anyhow::Error` for display.
+
+---
+
+## Logging
+
+### Use the `log` crate macros
+
+All log output must go through the `log` crate. Never use `println!`, `eprintln!`, or `dbg!` in library code. This makes output compatible with any logging backend (env_logger, tracing, structured logging, etc.).
+
+```rust
+// Bad
+println!("Creating VM: {}", name);
+
+// Good
+use log::{info, debug, warn};
+info!("Creating VM: {name}");
+debug!("VM XML:\n{xml}");
+warn!("Network already active, skipping creation");
+```
+
+Use the right level:
+
+| Level   | When to use |
+|---------|-------------|
+| `error` | Unrecoverable failures (before returning Err) |
+| `warn`  | Recoverable issues, skipped steps |
+| `info`  | High-level progress events visible in normal operation |
+| `debug` | Detailed operational info useful for debugging |
+| `trace` | Very granular, per-iteration or per-call data |
+
+Log before significant operations and after unexpected conditions. Do not log inside tight loops at `info` level.
+
+---
+
+## Types and Builders
+
+### Derive `Serialize` on all public domain types
+
+All public structs and enums that represent configuration or state should derive `serde::Serialize`. Add `Deserialize` when round-trip serialization is needed.
+
+### Builder pattern for complex configs
+
+When a type has more than three fields or optional fields, provide a builder. The builder pattern allows named, incremental construction without positional arguments.
+
+```rust
+let config = VmConfig::builder("bootstrap")
+    .cpu(4)
+    .memory_gb(8)
+    .disk(DiskConfig::new(50).labeled("os"))
+    .disk(DiskConfig::new(100).labeled("data"))
+    .network(NetworkRef::named("harmonylan"))
+    .boot_order([BootDevice::Network, BootDevice::Disk])
+    .build();
+```
+
+### Avoid `pub` fields on config structs
+
+Expose data through methods or the builder, not raw field access. This preserves the ability to validate, rename, or change representation without breaking callers.
+
+---
+
+## Async
+
+### Use `tokio` for all async runtime needs
+
+All async code runs on tokio. Use `tokio::spawn`, `tokio::time`, etc. Use `#[async_trait]` for traits with async methods.
+
+### No blocking in async context
+
+Never call blocking I/O (file I/O, network, process spawn) directly in an async function. Use `tokio::fs`, `tokio::process`, or `tokio::task::spawn_blocking` as appropriate.
+
+---
+
+## Module Structure
+
+### Follow the `Score` / `Interpret` pattern
+
+Modules that represent deployable infrastructure should implement `Score<T: Topology>` and `Interpret<T>`:
+
+- `Score` is the serializable, clonable configuration declaring *what* to deploy
+- `Interpret` does the actual work when `execute()` is called
+
+```rust
+pub struct KvmScore {
+    network: NetworkConfig,
+    vms: Vec<VmConfig>,
+}
+
+impl<T: Topology + KvmHost> Score<T> for KvmScore {
+    fn create_interpret(&self) -> Box<dyn Interpret<T>> {
+        Box::new(KvmInterpret::new(self.clone()))
+    }
+    fn name(&self) -> String { "KvmScore".to_string() }
+}
+```
+
+### Flatten the public API in `mod.rs`
+
+Internal submodules are implementation detail. Re-export what callers need at the module root:
+
+```rust
+// modules/kvm/mod.rs
+mod connection;
+mod domain;
+mod network;
+mod error;
+mod xml;
+
+pub use connection::KvmConnection;
+pub use domain::{VmConfig, VmConfigBuilder, VmStatus, DiskConfig, BootDevice};
+pub use error::KvmError;
+pub use network::NetworkConfig;
+```
+
+---
+
+## Commit Style
+
+Follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/):
+
+```
+feat(kvm): add network isolation support
+fix(kvm): correct memory unit conversion for libvirt
+refactor(kvm): replace virsh subprocess calls with virt crate bindings
+docs: add coding guide
+```
+
+Keep pull requests small and single-purpose (under ~200 lines excluding generated code). Do not mix refactoring, bug fixes, and new features in one PR.
--- a/docs/guides/kubernetes-ingress.md
+++ b/docs/guides/kubernetes-ingress.md
@@ -0,0 +1,158 @@
+# Ingress Resources in Harmony
+
+Harmony generates standard Kubernetes `networking.k8s.io/v1` Ingress resources. This ensures your deployments are portable across any Kubernetes distribution (vanilla K8s, OKD/OpenShift, K3s, etc.) without requiring vendor-specific configurations.
+
+By default, Harmony does **not** set `spec.ingressClassName`. This allows the cluster's default ingress controller to automatically claim the resource, which is the correct approach for most single-controller clusters.
+
+---
+
+## TLS Configurations
+
+There are two portable TLS modes for Ingress resources. Use only these in your Harmony deployments.
+
+### 1. Plain HTTP (No TLS)
+
+Omit the `tls` block entirely. The Ingress serves traffic over plain HTTP. Use this for local development or when TLS is terminated elsewhere (e.g., by a service mesh or external load balancer).
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: my-app
+  namespace: my-ns
+spec:
+  rules:
+  - host: app.example.com
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: my-app
+            port:
+              number: 8080
+```
+
+### 2. HTTPS with a Named TLS Secret
+
+Provide a `tls` block with both `hosts` and a `secretName`. The ingress controller will use that Secret for TLS termination. The Secret must be a `kubernetes.io/tls` type in the same namespace as the Ingress.
+
+There are two ways to provide this Secret.
+
+#### Option A: Manual Secret
+
+Create the TLS Secret yourself before deploying the Ingress. This is suitable when certificates are issued outside the cluster or managed by another system.
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: my-app
+  namespace: my-ns
+spec:
+  rules:
+  - host: app.example.com
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: my-app
+            port:
+              number: 8080
+  tls:
+  - hosts:
+    - app.example.com
+    secretName: app-example-com-tls
+```
+
+#### Option B: Automated via cert-manager (Recommended)
+
+Add the `cert-manager.io/cluster-issuer` annotation to the Ingress. cert-manager will automatically perform the ACME challenge, generate the certificate, store it in the named Secret, and handle renewal. You do not create the Secret yourself.
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: my-app
+  namespace: my-ns
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt-prod
+spec:
+  rules:
+  - host: app.example.com
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: my-app
+            port:
+              number: 8080
+  tls:
+  - hosts:
+    - app.example.com
+    secretName: app-example-com-tls
+```
+
+If you use a namespace-scoped `Issuer` instead of a `ClusterIssuer`, replace the annotation with `cert-manager.io/issuer: <name>`.
+
+---
+
+## Do Not Use: TLS Without `secretName`
+
+Avoid TLS entries that omit `secretName`:
+
+```yaml
+# ⚠️ Non-portable — do not use
+tls:
+- hosts:
+  - app.example.com
+```
+
+Behavior for this pattern is **controller-specific and not portable**. On OKD/OpenShift, the ingress-to-route translation rejects it as incomplete. On other controllers, it may silently serve a self-signed fallback or fail in unpredictable ways. Harmony does not support this pattern.
+
+---
+
+## Prerequisites for cert-manager
+
+To use automated certificates (Option B above):
+
+1. **cert-manager** must be installed on the cluster.
+2. A `ClusterIssuer` or `Issuer` must exist. A typical Let's Encrypt production issuer:
+
+```yaml
+apiVersion: cert-manager.io/v1
+kind: ClusterIssuer
+metadata:
+  name: letsencrypt-prod
+spec:
+  acme:
+    server: https://acme-v02.api.letsencrypt.org/directory
+    email: team@example.com
+    privateKeySecretRef:
+      name: letsencrypt-prod-account-key
+    solvers:
+    - http01:
+        ingress: {}
+```
+
+3. **DNS must already resolve** to the cluster's ingress endpoint before the Ingress is created. The HTTP01 challenge requires this routing to be active.
+
+For wildcard certificates (e.g. `*.example.com`), HTTP01 cannot be used — configure a DNS01 solver with credentials for your DNS provider instead.
+
+---
+
+## OKD / OpenShift Notes
+
+On OKD, standard Ingress resources are automatically translated into OpenShift `Route` objects. The default TLS termination mode is `edge`, which is correct for most HTTP applications. To control this explicitly, add:
+
+```yaml
+annotations:
+  route.openshift.io/termination: edge  # or passthrough / reencrypt
+```
+
+This annotation is ignored on non-OpenShift clusters and is safe to include unconditionally.
--- a/docs/one_liners.md
+++ b/docs/one_liners.md
@@ -0,0 +1,16 @@
+# Handy one liners for infrastructure management
+
+### Delete all evicted pods from a cluster
+
+```sh
+ kubectl get po -A | grep Evic | awk '{ print "-n " $1 " " $2 }' | xargs -L 1 kubectl delete po
+```
+> Pods are evicted when the node they are running on lacks the ressources to keep them going. The most common case is when ephemeral storage becomes too full because of something like a log file getting too big.
+>
+> It could also happen because of memory or cpu pressure due to unpredictable workloads.
+> 
+> This means it is generally ok to delete them.
+>
+> However, in a perfectly configured deployment and cluster, pods should rarely, if ever, get evicted. For example, a log file getting too big should be reconfigured not to use too much space, or the deployment should be configured to reserve the correct amount of ephemeral storage space.
+>
+> Note that deleting evicted pods do not solve the underlying issue, make sure to understand why the pod was evicted in the first place and put the proper solution in place.
--- a/examples/example_linux_vm/Cargo.toml
+++ b/examples/example_linux_vm/Cargo.toml
@@ -0,0 +1,15 @@
+[package]
+name = "example_linux_vm"
+version.workspace = true
+edition = "2024"
+license.workspace = true
+
+[[bin]]
+name = "example_linux_vm"
+path = "src/main.rs"
+
+[dependencies]
+harmony = { path = "../../harmony" }
+tokio.workspace = true
+log.workspace = true
+env_logger.workspace = true
--- a/examples/example_linux_vm/README.md
+++ b/examples/example_linux_vm/README.md
@@ -0,0 +1,43 @@
+# Example: Linux VM from ISO
+
+This example deploys a simple Linux virtual machine from an ISO URL.
+
+## What it creates
+
+- One isolated virtual network (`linuxvm-net`, 192.168.101.0/24)
+- One Ubuntu Server VM with the ISO attached as a CD-ROM
+- The VM is configured to boot from the CD-ROM first, allowing installation
+- After installation, the VM can be rebooted to boot from disk
+
+## Prerequisites
+
+- A running KVM hypervisor (local or remote)
+- `HARMONY_KVM_URI` environment variable pointing to the hypervisor (defaults to `qemu:///system`)
+- `HARMONY_KVM_IMAGE_DIR` environment variable for storing VM images (defaults to harmony data dir)
+
+## Usage
+
+```bash
+cargo run -p example_linux_vm
+```
+
+## After deployment
+
+Once the VM is running, you can connect to its console:
+
+```bash
+virsh -c qemu:///system console linux-vm
+```
+
+To access the VM via SSH after installation, you'll need to configure a bridged network or port forwarding.
+
+## Clean up
+
+To remove the VM and network:
+
+```bash
+virsh -c qemu:///system destroy linux-vm
+virsh -c qemu:///system undefine linux-vm
+virsh -c qemu:///system net-destroy linuxvm-net
+virsh -c qemu:///system net-undefine linuxvm-net
+```
--- a/examples/example_linux_vm/src/main.rs
+++ b/examples/example_linux_vm/src/main.rs
@@ -0,0 +1,63 @@
+use harmony::modules::kvm::config::init_executor;
+use harmony::modules::kvm::{BootDevice, NetworkConfig, NetworkRef, VmConfig};
+use log::info;
+
+const NETWORK_NAME: &str = "linuxvm-net";
+const NETWORK_GATEWAY: &str = "192.168.101.1";
+const NETWORK_PREFIX: u8 = 24;
+
+const UBUNTU_ISO_URL: &str =
+    "https://releases.ubuntu.com/24.04/ubuntu-24.04.3-live-server-amd64.iso";
+
+pub async fn deploy_linux_vm() -> Result<(), String> {
+    let executor = init_executor().map_err(|e| format!("KVM initialization failed: {e}"))?;
+
+    let network = NetworkConfig::builder(NETWORK_NAME)
+        .bridge("virbr101")
+        .subnet(NETWORK_GATEWAY, NETWORK_PREFIX)
+        .build();
+
+    info!("Ensuring network '{NETWORK_NAME}' ({NETWORK_GATEWAY}/{NETWORK_PREFIX}) exists");
+    executor
+        .ensure_network(network)
+        .await
+        .map_err(|e| format!("Network setup failed: {e}"))?;
+
+    let vm = linux_vm();
+    info!("Defining Linux VM '{}'", vm.name);
+    executor
+        .ensure_vm(vm.clone())
+        .await
+        .map_err(|e| format!("Linux VM setup failed: {e}"))?;
+
+    info!("Starting VM '{}'", vm.name);
+    executor
+        .start_vm(&vm.name)
+        .await
+        .map_err(|e| format!("Failed to start VM: {e}"))?;
+
+    info!(
+        "Linux VM '{}' is running. \
+         Connect to the console using: virsh -c qemu:///system console {}",
+        vm.name, vm.name
+    );
+
+    Ok(())
+}
+
+fn linux_vm() -> VmConfig {
+    VmConfig::builder("linux-vm")
+        .vcpus(2)
+        .memory_gb(4)
+        .disk(20)
+        .network(NetworkRef::named(NETWORK_NAME))
+        .cdrom(UBUNTU_ISO_URL)
+        .boot_order([BootDevice::Cdrom, BootDevice::Disk])
+        .build()
+}
+
+#[tokio::main]
+async fn main() -> Result<(), String> {
+    env_logger::init();
+    deploy_linux_vm().await
+}
--- a/examples/kvm_okd_ha_cluster/Cargo.toml
+++ b/examples/kvm_okd_ha_cluster/Cargo.toml
@@ -0,0 +1,15 @@
+[package]
+name = "example-kvm-okd-ha-cluster"
+version.workspace = true
+edition = "2024"
+license.workspace = true
+
+[[bin]]
+name = "kvm_okd_ha_cluster"
+path = "src/main.rs"
+
+[dependencies]
+harmony = { path = "../../harmony" }
+tokio.workspace = true
+log.workspace = true
+env_logger.workspace = true
--- a/examples/kvm_okd_ha_cluster/README.md
+++ b/examples/kvm_okd_ha_cluster/README.md
@@ -0,0 +1,100 @@
+# OKD HA Cluster on KVM
+
+Deploys a complete OKD high-availability cluster on a KVM hypervisor using
+Harmony's KVM module. All infrastructure is defined in Rust — no YAML, no
+shell scripts, no hand-crafted XML.
+
+## What it creates
+
+| Resource          | Details                                  |
+|-------------------|------------------------------------------|
+| Virtual network   | `harmonylan` — 192.168.100.0/24, NAT     |
+| OPNsense VM       | 2 vCPU / 4 GiB RAM — gateway + PXE      |
+| Control plane ×3  | 4 vCPU / 16 GiB RAM — `cp0` … `cp2`     |
+| Worker ×3         | 8 vCPU / 32 GiB RAM — `worker0` … `worker2` |
+
+## Architecture
+
+All VMs share the same `harmonylan` virtual network. OPNsense sits on both
+that network and the host bridge, acting as the gateway and PXE server.
+
+```
+ Host network (bridge)
+        │
+┌───────┴──────────┐
+│  OPNsense        │  192.168.100.1
+│  gateway + PXE   │
+└───────┬──────────┘
+        │
+        │  harmonylan  (192.168.100.0/24)
+        ├─────────────┬──────────────────┬──────────────────┐
+        │             │                  │                  │
+┌───────┴──┐   ┌──────┴───┐   ┌──────────┴─┐   ┌──────────┴─┐
+│ cp0      │   │ cp1      │   │ cp2        │   │ worker0    │
+│ .10      │   │ .11      │   │ .12        │   │ .20        │
+└──────────┘   └──────────┘   └────────────┘   └──────┬─────┘
+                                                       │
+                                               ┌───────┴────┐
+                                               │ worker1    │
+                                               │ .21        │
+                                               └───────┬────┘
+                                                       │
+                                               ┌───────┴────┐
+                                               │ worker2    │
+                                               │ .22        │
+                                               └────────────┘
+```
+
+All nodes PXE boot from the network interface. OPNsense serves the OKD
+bootstrap images via TFTP/iPXE and handles DHCP for the whole subnet.
+
+## Prerequisites
+
+- Linux host with KVM/QEMU and libvirt installed
+- `libvirt-dev` headers (for building the `virt` crate)
+- A `default` storage pool configured in libvirt
+- Sufficient disk space (~550 GiB for all VM images)
+
+## Running
+
+```bash
+cargo run --bin kvm_okd_ha_cluster
+```
+
+Set `RUST_LOG=info` (or `debug`) to control verbosity.
+
+## Configuration
+
+| Environment variable    | Default            | Description                         |
+|-------------------------|--------------------|-------------------------------------|
+| `HARMONY_KVM_URI`       | `qemu:///system`   | Libvirt connection URI              |
+| `HARMONY_KVM_IMAGE_DIR` | harmony data dir   | Directory for qcow2 disk images     |
+
+For a remote KVM host over SSH:
+
+```bash
+export HARMONY_KVM_URI="qemu+ssh://user@myhost/system"
+```
+
+## What happens after `cargo run`
+
+The program defines all resources in libvirt but does not start any VMs.
+Next steps:
+
+1. Start OPNsense: `virsh start opnsense-harmony`
+2. Connect to the OPNsense web UI at `https://192.168.100.1`
+3. Configure DHCP, TFTP, and the iPXE menu for OKD
+4. Start the control plane and worker nodes — they will PXE boot and begin
+   the OKD installation automatically
+
+## Cleanup
+
+```bash
+for vm in opnsense-harmony cp0-harmony cp1-harmony cp2-harmony \
+          worker0-harmony worker1-harmony worker2-harmony; do
+    virsh destroy "$vm" 2>/dev/null || true
+    virsh undefine "$vm" --remove-all-storage 2>/dev/null || true
+done
+virsh net-destroy harmonylan 2>/dev/null || true
+virsh net-undefine harmonylan 2>/dev/null || true
+```
--- a/examples/kvm_okd_ha_cluster/src/lib.rs
+++ b/examples/kvm_okd_ha_cluster/src/lib.rs
@@ -0,0 +1,132 @@
+use harmony::modules::kvm::{
+    BootDevice, NetworkConfig, NetworkRef, VmConfig, config::init_executor,
+};
+use log::info;
+
+const NETWORK_NAME: &str = "harmonylan";
+const NETWORK_GATEWAY: &str = "192.168.100.1";
+const NETWORK_PREFIX: u8 = 24;
+
+const OPNSENSE_IP: &str = "192.168.100.1";
+
+/// Deploys a full OKD HA cluster on a local or remote KVM hypervisor.
+///
+/// # What it creates
+///
+/// - One isolated virtual network (`harmonylan`, 192.168.100.0/24)
+/// - One OPNsense VM acting as the cluster gateway and PXE server
+/// - Three OKD control-plane nodes
+/// - Three OKD worker nodes
+///
+/// All nodes are configured to PXE boot from the network so that OPNsense
+/// can drive unattended OKD installation via TFTP/iPXE.
+///
+/// # Configuration
+///
+/// | Environment variable      | Default               | Description                       |
+/// |---------------------------|-----------------------|-----------------------------------|
+/// | `HARMONY_KVM_URI`         | `qemu:///system`      | Libvirt connection URI            |
+/// | `HARMONY_KVM_IMAGE_DIR`   | harmony data dir      | Directory for qcow2 disk images   |
+pub async fn deploy_okd_ha_cluster() -> Result<(), String> {
+    let executor = init_executor().map_err(|e| format!("KVM initialisation failed: {e}"))?;
+
+    // -------------------------------------------------------------------------
+    // Network
+    // -------------------------------------------------------------------------
+    let network = NetworkConfig::builder(NETWORK_NAME)
+        .bridge("virbr100")
+        .subnet(NETWORK_GATEWAY, NETWORK_PREFIX)
+        .build();
+
+    info!("Ensuring network '{NETWORK_NAME}' ({NETWORK_GATEWAY}/{NETWORK_PREFIX}) exists");
+    executor
+        .ensure_network(network)
+        .await
+        .map_err(|e| format!("Network setup failed: {e}"))?;
+
+    // -------------------------------------------------------------------------
+    // OPNsense gateway / PXE server
+    // -------------------------------------------------------------------------
+    let opnsense = opnsense_vm();
+    info!("Defining OPNsense VM '{}'", opnsense.name);
+    executor
+        .ensure_vm(opnsense)
+        .await
+        .map_err(|e| format!("OPNsense VM setup failed: {e}"))?;
+
+    // -------------------------------------------------------------------------
+    // Control plane nodes
+    // -------------------------------------------------------------------------
+    for i in 0u8..3 {
+        let vm = control_plane_vm(i);
+        info!("Defining control plane VM '{}'", vm.name);
+        executor
+            .ensure_vm(vm)
+            .await
+            .map_err(|e| format!("Control plane VM setup failed: {e}"))?;
+    }
+
+    // -------------------------------------------------------------------------
+    // Worker nodes
+    // -------------------------------------------------------------------------
+    for i in 0u8..3 {
+        let vm = worker_vm(i);
+        info!("Defining worker VM '{}'", vm.name);
+        executor
+            .ensure_vm(vm)
+            .await
+            .map_err(|e| format!("Worker VM setup failed: {e}"))?;
+    }
+
+    info!(
+        "OKD HA cluster infrastructure ready. \
+         Connect OPNsense at https://{OPNSENSE_IP} to configure DHCP, TFTP, and PXE \
+         before starting the nodes."
+    );
+    Ok(())
+}
+
+// -----------------------------------------------------------------------------
+// VM definitions
+// -----------------------------------------------------------------------------
+
+/// OPNsense firewall — gateway and PXE server for the cluster.
+///
+/// Connected to both the host bridge (WAN) and `harmonylan` (LAN). It manages
+/// DHCP, TFTP, and the PXE menu that drives OKD installation on all other VMs.
+fn opnsense_vm() -> VmConfig {
+    VmConfig::builder("opnsense-harmony")
+        .vcpus(2)
+        .memory_gb(4)
+        .disk(20) // OS disk: vda
+        .network(NetworkRef::named(NETWORK_NAME))
+        .boot_order([BootDevice::Cdrom, BootDevice::Disk])
+        .build()
+}
+
+/// One OKD control-plane node. Indexed 0..2 → `cp0-harmony` … `cp2-harmony`.
+///
+/// Boots from network so OPNsense can serve the OKD bootstrap image via PXE.
+fn control_plane_vm(index: u8) -> VmConfig {
+    VmConfig::builder(format!("cp{index}-harmony"))
+        .vcpus(4)
+        .memory_gb(16)
+        .disk(120) // OS + etcd: vda
+        .network(NetworkRef::named(NETWORK_NAME))
+        .boot_order([BootDevice::Network, BootDevice::Disk])
+        .build()
+}
+
+/// One OKD worker node. Indexed 0..2 → `worker0-harmony` … `worker2-harmony`.
+///
+/// Boots from network for automated OKD installation.
+fn worker_vm(index: u8) -> VmConfig {
+    VmConfig::builder(format!("worker{index}-harmony"))
+        .vcpus(8)
+        .memory_gb(32)
+        .disk(120) // OS: vda
+        .disk(200) // Persistent storage (ODF/Rook): vdb
+        .network(NetworkRef::named(NETWORK_NAME))
+        .boot_order([BootDevice::Network, BootDevice::Disk])
+        .build()
+}
--- a/examples/kvm_okd_ha_cluster/src/main.rs
+++ b/examples/kvm_okd_ha_cluster/src/main.rs
@@ -0,0 +1,7 @@
+use example_kvm_okd_ha_cluster::deploy_okd_ha_cluster;
+
+#[tokio::main]
+async fn main() -> Result<(), String> {
+    env_logger::init();
+    deploy_okd_ha_cluster().await
+}
--- a/examples/penpot/Cargo.toml
+++ b/examples/penpot/Cargo.toml
@@ -0,0 +1,14 @@
+[package]
+name = "example-penpot"
+edition = "2024"
+version.workspace = true
+readme.workspace = true
+license.workspace = true
+
+[dependencies]
+harmony = { path = "../../harmony" }
+harmony_cli = { path = "../../harmony_cli" }
+harmony_macros = { path = "../../harmony_macros" }
+harmony_types = { path = "../../harmony_types" }
+tokio.workspace = true
+url.workspace = true
--- a/examples/penpot/src/main.rs
+++ b/examples/penpot/src/main.rs
@@ -0,0 +1,41 @@
+use std::{collections::HashMap, str::FromStr};
+
+use harmony::{
+    inventory::Inventory,
+    modules::helm::chart::{HelmChartScore, HelmRepository, NonBlankString},
+    topology::K8sAnywhereTopology,
+};
+use harmony_macros::hurl;
+
+#[tokio::main]
+async fn main() {
+    // let mut chart_values = HashMap::new();
+    // chart_values.insert(
+    //     NonBlankString::from_str("persistence.assets.enabled").unwrap(),
+    //     "true".into(),
+    // );
+    // let penpot_chart = HelmChartScore {
+    //     namespace: Some(NonBlankString::from_str("penpot").unwrap()),
+    //     release_name: NonBlankString::from_str("penpot").unwrap(),
+    //     chart_name: NonBlankString::from_str("penpot/penpot").unwrap(),
+    //     chart_version: None,
+    //     values_overrides: Some(chart_values),
+    //     values_yaml: None,
+    //     create_namespace: true,
+    //     install_only: true,
+    //     repository: Some(HelmRepository::new(
+    //         "penpot".to_string(),
+    //         hurl!("http://helm.penpot.app"),
+    //         true,
+    //     )),
+    // };
+    //
+    // harmony_cli::run(
+    //     Inventory::autoload(),
+    //     K8sAnywhereTopology::from_env(),
+    //     vec![Box::new(penpot_chart)],
+    //     None,
+    // )
+    // .await
+    // .unwrap();
+}
--- a/examples/zitadel/zitadel-9.24.0.tgz
+++ b/examples/zitadel/zitadel-9.24.0.tgz
--- a/harmony/Cargo.toml
+++ b/harmony/Cargo.toml
@@ -78,11 +78,13 @@ harmony_inventory_agent = { path = "../harmony_inventory_agent" }
 harmony_secret_derive = { path = "../harmony_secret_derive" }
 harmony_secret = { path = "../harmony_secret" }
 askama.workspace = true
+sha2 = "0.10"
 sqlx.workspace = true
 inquire.workspace = true
 brocade = { path = "../brocade" }
 option-ext = "0.2.0"
 rand.workspace = true
+virt = "0.4.3"

 [dev-dependencies]
 pretty_assertions.workspace = true
--- a/harmony/src/modules/kvm/config.rs
+++ b/harmony/src/modules/kvm/config.rs
@@ -0,0 +1,54 @@
+use log::{debug, info};
+
+use crate::domain::config::HARMONY_DATA_DIR;
+
+use super::error::KvmError;
+use super::executor::KvmExecutor;
+
+const DEFAULT_IMAGE_DIR: &str = "/var/lib/libvirt/images";
+
+/// Creates a [`KvmExecutor`] from environment variables.
+///
+/// | Variable                  | Description                                        |
+/// |---------------------------|----------------------------------------------------|
+/// | `HARMONY_KVM_URI`         | Full libvirt URI. Defaults to `qemu:///system`.    |
+/// | `HARMONY_KVM_IMAGE_DIR`   | Directory for VM disk images. Defaults to `/var/lib/libvirt/images`. |
+///
+/// For backwards compatibility, `HARMONY_KVM_CONNECTION` is also accepted as
+/// an alias for `HARMONY_KVM_URI`.
+pub fn init_executor() -> Result<KvmExecutor, KvmError> {
+    let uri = std::env::var("HARMONY_KVM_URI")
+        .or_else(|_| std::env::var("HARMONY_KVM_CONNECTION"))
+        .unwrap_or_else(|_| "qemu:///system".to_string());
+
+    let image_dir = std::env::var("HARMONY_KVM_IMAGE_DIR").unwrap_or_else(|_| {
+        // Fall back to the harmony data dir if available, else the system default.
+        let data_dir = HARMONY_DATA_DIR.join("kvm").join("images");
+        let path = data_dir.to_string_lossy().to_string();
+        debug!("HARMONY_KVM_IMAGE_DIR not set; using {path}");
+        path
+    });
+
+    if uri.starts_with("qemu+ssh://") {
+        validate_ssh_uri(&uri)?;
+    }
+
+    info!("KVM executor initialised: uri={uri}, image_dir={image_dir}");
+    Ok(KvmExecutor::new(uri, image_dir))
+}
+
+/// Validates that an SSH URI looks structurally correct and returns an error
+/// with a helpful message when it does not.
+fn validate_ssh_uri(uri: &str) -> Result<(), KvmError> {
+    // Expected form: qemu+ssh://user@host/system
+    let without_scheme = uri
+        .strip_prefix("qemu+ssh://")
+        .ok_or_else(|| KvmError::InvalidUri(uri.to_string()))?;
+
+    if !without_scheme.contains('@') || !without_scheme.contains('/') {
+        return Err(KvmError::InvalidUri(format!(
+            "expected qemu+ssh://user@host/system, got: {uri}"
+        )));
+    }
+    Ok(())
+}
--- a/harmony/src/modules/kvm/error.rs
+++ b/harmony/src/modules/kvm/error.rs
@@ -0,0 +1,39 @@
+use std::io::Error as IoError;
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+pub enum KvmError {
+    #[error("connection failed to '{uri}': {source}")]
+    ConnectionFailed {
+        uri: String,
+        #[source]
+        source: virt::error::Error,
+    },
+
+    #[error("invalid connection URI: {0}")]
+    InvalidUri(String),
+
+    #[error("VM '{name}' already exists")]
+    VmAlreadyExists { name: String },
+
+    #[error("VM '{name}' not found")]
+    VmNotFound { name: String },
+
+    #[error("network '{name}' already exists")]
+    NetworkAlreadyExists { name: String },
+
+    #[error("network '{name}' not found")]
+    NetworkNotFound { name: String },
+
+    #[error("storage pool '{name}' not found")]
+    StoragePoolNotFound { name: String },
+
+    #[error("ISO download failed: {0}")]
+    IsoDownload(String),
+
+    #[error("libvirt error: {0}")]
+    Libvirt(#[from] virt::error::Error),
+
+    #[error("IO error: {0}")]
+    Io(#[from] IoError),
+}
--- a/harmony/src/modules/kvm/executor.rs
+++ b/harmony/src/modules/kvm/executor.rs
@@ -0,0 +1,367 @@
+use log::{debug, info, warn};
+use virt::connect::Connect;
+use virt::domain::Domain;
+use virt::network::Network;
+use virt::storage_pool::StoragePool;
+use virt::storage_vol::StorageVol;
+use virt::sys;
+
+use super::error::KvmError;
+use super::types::{CdromConfig, NetworkConfig, VmConfig, VmStatus};
+use super::xml;
+
+/// A handle to a libvirt hypervisor.
+///
+/// Wraps a [`virt::connect::Connect`] and provides high-level operations for
+/// virtual machines, networks, and storage volumes. All methods that call
+/// libvirt are dispatched to a blocking thread via
+/// [`tokio::task::spawn_blocking`] to avoid blocking the async executor.
+#[derive(Clone)]
+pub struct KvmExecutor {
+    /// Libvirt connection URI (e.g. `qemu:///system`).
+    uri: String,
+    /// Path used as the base image directory for new VM disks.
+    image_dir: String,
+}
+
+impl KvmExecutor {
+    /// Creates an executor that will open a libvirt connection on each
+    /// blocking call. Connection is not held across calls to keep `Clone`
+    /// and `Send` simple.
+    pub fn new(uri: impl Into<String>, image_dir: impl Into<String>) -> Self {
+        Self {
+            uri: uri.into(),
+            image_dir: image_dir.into(),
+        }
+    }
+
+    fn open_connection(&self) -> Result<Connect, KvmError> {
+        let uri = self.uri.clone();
+        Connect::open(Some(&uri)).map_err(|e| KvmError::ConnectionFailed {
+            uri: uri.clone(),
+            source: e,
+        })
+    }
+
+    // -------------------------------------------------------------------------
+    // Networks
+    // -------------------------------------------------------------------------
+
+    /// Ensures the given network exists and is active.
+    ///
+    /// If the network already exists, it is started if not already active.
+    /// If it does not exist, it is defined and started.
+    pub async fn ensure_network(&self, cfg: NetworkConfig) -> Result<(), KvmError> {
+        let executor = self.clone();
+        tokio::task::spawn_blocking(move || executor.ensure_network_blocking(&cfg))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn ensure_network_blocking(&self, cfg: &NetworkConfig) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        match Network::lookup_by_name(&conn, &cfg.name) {
+            Ok(net) => {
+                if !net.is_active()? {
+                    info!("Network '{}' exists but is inactive; starting it", cfg.name);
+                    net.create()?;
+                } else {
+                    debug!("Network '{}' already active", cfg.name);
+                }
+                if !net.get_autostart()? {
+                    net.set_autostart(true)?;
+                }
+            }
+            Err(_) => {
+                info!("Defining network '{}'", cfg.name);
+                let xml = xml::network_xml(cfg);
+                debug!("Network XML:\n{xml}");
+                let net = Network::define_xml(&conn, &xml)?;
+                net.create()?;
+                net.set_autostart(true)?;
+                info!("Network '{}' created and active", cfg.name);
+            }
+        }
+        Ok(())
+    }
+
+    /// Stops and removes a network. No-ops if the network does not exist.
+    pub async fn delete_network(&self, name: &str) -> Result<(), KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.delete_network_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn delete_network_blocking(&self, name: &str) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        match Network::lookup_by_name(&conn, name) {
+            Ok(net) => {
+                if net.is_active()? {
+                    info!("Destroying network '{name}'");
+                    net.destroy()?;
+                }
+                net.undefine()?;
+                info!("Network '{name}' removed");
+                Ok(())
+            }
+            Err(_) => {
+                warn!("delete_network: network '{name}' not found, skipping");
+                Ok(())
+            }
+        }
+    }
+
+    // -------------------------------------------------------------------------
+    // Domains (VMs)
+    // -------------------------------------------------------------------------
+
+    /// Returns `true` if a domain with `name` is known to libvirt.
+    pub async fn vm_exists(&self, name: &str) -> Result<bool, KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.vm_exists_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn vm_exists_blocking(&self, name: &str) -> Result<bool, KvmError> {
+        let conn = self.open_connection()?;
+        match Domain::lookup_by_name(&conn, name) {
+            Ok(_) => Ok(true),
+            Err(_) => Ok(false),
+        }
+    }
+
+    /// Defines a VM from `config`, creating storage volumes as needed.
+    ///
+    /// Fails if the VM already exists. Use [`KvmExecutor::ensure_vm`] for
+    /// idempotent behaviour.
+    pub async fn define_vm(&self, config: VmConfig) -> Result<(), KvmError> {
+        let executor = self.clone();
+        tokio::task::spawn_blocking(move || executor.define_vm_blocking(&config))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn define_vm_blocking(&self, config: &VmConfig) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+
+        if Domain::lookup_by_name(&conn, &config.name).is_ok() {
+            return Err(KvmError::VmAlreadyExists {
+                name: config.name.clone(),
+            });
+        }
+
+        self.create_volumes_blocking(&conn, config)?;
+
+        let xml = xml::domain_xml(config, &self.image_dir);
+        debug!("Defining domain '{}' with XML:\n{xml}", config.name);
+        Domain::define_xml(&conn, &xml)?;
+        info!("VM '{}' defined", config.name);
+        Ok(())
+    }
+
+    /// Idempotent: defines the VM if it does not already exist.
+    pub async fn ensure_vm(&self, config: VmConfig) -> Result<(), KvmError> {
+        let executor = self.clone();
+        tokio::task::spawn_blocking(move || executor.ensure_vm_blocking(&config))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn ensure_vm_blocking(&self, config: &VmConfig) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        if Domain::lookup_by_name(&conn, &config.name).is_ok() {
+            debug!("VM '{}' already defined, skipping", config.name);
+            return Ok(());
+        }
+        self.create_volumes_blocking(&conn, config)?;
+        let xml = xml::domain_xml(config, &self.image_dir);
+        debug!("Defining domain '{}' with XML:\n{xml}", config.name);
+        Domain::define_xml(&conn, &xml)?;
+        info!("VM '{}' defined", config.name);
+        Ok(())
+    }
+
+    /// Starts a defined VM.
+    pub async fn start_vm(&self, name: &str) -> Result<(), KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.start_vm_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn start_vm_blocking(&self, name: &str) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        let dom = Domain::lookup_by_name(&conn, name).map_err(|_| KvmError::VmNotFound {
+            name: name.to_string(),
+        })?;
+        dom.create()?;
+        info!("VM '{name}' started");
+        Ok(())
+    }
+
+    /// Gracefully shuts down a VM.
+    pub async fn shutdown_vm(&self, name: &str) -> Result<(), KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.shutdown_vm_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn shutdown_vm_blocking(&self, name: &str) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        let dom = Domain::lookup_by_name(&conn, name).map_err(|_| KvmError::VmNotFound {
+            name: name.to_string(),
+        })?;
+        dom.shutdown()?;
+        info!("VM '{name}' shutdown requested");
+        Ok(())
+    }
+
+    /// Forcibly powers off a VM without a graceful shutdown.
+    pub async fn destroy_vm(&self, name: &str) -> Result<(), KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.destroy_vm_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn destroy_vm_blocking(&self, name: &str) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        let dom = Domain::lookup_by_name(&conn, name).map_err(|_| KvmError::VmNotFound {
+            name: name.to_string(),
+        })?;
+        dom.destroy()?;
+        info!("VM '{name}' forcibly destroyed");
+        Ok(())
+    }
+
+    /// Undefines (removes) a VM. The VM must not be running.
+    pub async fn undefine_vm(&self, name: &str) -> Result<(), KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.undefine_vm_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn undefine_vm_blocking(&self, name: &str) -> Result<(), KvmError> {
+        let conn = self.open_connection()?;
+        match Domain::lookup_by_name(&conn, name) {
+            Ok(dom) => {
+                dom.undefine()?;
+                info!("VM '{name}' undefined");
+                Ok(())
+            }
+            Err(_) => {
+                warn!("undefine_vm: VM '{name}' not found, skipping");
+                Ok(())
+            }
+        }
+    }
+
+    /// Returns the current [`VmStatus`] of a VM.
+    pub async fn vm_status(&self, name: &str) -> Result<VmStatus, KvmError> {
+        let executor = self.clone();
+        let name = name.to_string();
+        tokio::task::spawn_blocking(move || executor.vm_status_blocking(&name))
+            .await
+            .expect("blocking task panicked")
+    }
+
+    fn vm_status_blocking(&self, name: &str) -> Result<VmStatus, KvmError> {
+        let conn = self.open_connection()?;
+        let dom = Domain::lookup_by_name(&conn, name).map_err(|_| KvmError::VmNotFound {
+            name: name.to_string(),
+        })?;
+        let (state, _reason) = dom.get_state()?;
+        let status = match state {
+            sys::VIR_DOMAIN_RUNNING | sys::VIR_DOMAIN_BLOCKED => VmStatus::Running,
+            sys::VIR_DOMAIN_PAUSED => VmStatus::Paused,
+            sys::VIR_DOMAIN_SHUTDOWN | sys::VIR_DOMAIN_SHUTOFF => VmStatus::Shutoff,
+            sys::VIR_DOMAIN_CRASHED => VmStatus::Crashed,
+            sys::VIR_DOMAIN_PMSUSPENDED => VmStatus::PMSuspended,
+            _ => VmStatus::Other,
+        };
+        Ok(status)
+    }
+
+    // -------------------------------------------------------------------------
+    // Storage
+    // -------------------------------------------------------------------------
+
+    fn create_volumes_blocking(&self, conn: &Connect, config: &VmConfig) -> Result<(), KvmError> {
+        for disk in &config.disks {
+            let pool = StoragePool::lookup_by_name(conn, &disk.pool).map_err(|_| {
+                KvmError::StoragePoolNotFound {
+                    name: disk.pool.clone(),
+                }
+            })?;
+
+            let vol_name = format!("{}-{}", config.name, disk.device);
+            match StorageVol::lookup_by_name(&pool, &format!("{vol_name}.qcow2")) {
+                Ok(_) => {
+                    debug!(
+                        "Volume '{vol_name}.qcow2' already exists in pool '{}'",
+                        disk.pool
+                    );
+                }
+                Err(_) => {
+                    info!(
+                        "Creating volume '{vol_name}.qcow2' ({} GiB) in pool '{}'",
+                        disk.size_gb, disk.pool
+                    );
+                    let xml = xml::volume_xml(&vol_name, disk.size_gb);
+                    StorageVol::create_xml(&pool, &xml, 0)?;
+                }
+            }
+        }
+
+        for cdrom in &config.cdroms {
+            self.prepare_iso_blocking(cdrom)?;
+        }
+
+        Ok(())
+    }
+
+    fn prepare_iso_blocking(&self, cdrom: &CdromConfig) -> Result<(), KvmError> {
+        let source = &cdrom.source;
+
+        if source.starts_with("http://") || source.starts_with("https://") {
+            let file_name = source.split('/').last().unwrap_or("downloaded.iso");
+            let target_path = format!("{}/{}", self.image_dir, file_name);
+
+            if std::path::Path::new(&target_path).exists() {
+                info!("ISO '{}' already downloaded, skipping", file_name);
+                return Ok(());
+            }
+
+            info!("Downloading ISO '{}' to '{}'", file_name, target_path);
+            self.download_iso_blocking(source, &target_path)?;
+            info!("ISO '{}' downloaded successfully", file_name);
+        }
+
+        Ok(())
+    }
+
+    fn download_iso_blocking(&self, url: &str, target_path: &str) -> Result<(), KvmError> {
+        let response =
+            reqwest::blocking::get(url).map_err(|e| KvmError::IsoDownload(e.to_string()))?;
+
+        let mut file = std::fs::File::create(target_path)?;
+
+        let content = response
+            .bytes()
+            .map_err(|e| KvmError::IsoDownload(e.to_string()))?;
+
+        std::io::copy(&mut content.as_ref(), &mut file)?;
+
+        Ok(())
+    }
+}
--- a/harmony/src/modules/kvm/mod.rs
+++ b/harmony/src/modules/kvm/mod.rs
@@ -0,0 +1,13 @@
+mod xml;
+
+pub mod config;
+pub mod error;
+pub mod executor;
+pub mod types;
+
+pub use error::KvmError;
+pub use executor::KvmExecutor;
+pub use types::{
+    BootDevice, CdromConfig, DiskConfig, ForwardMode, NetworkConfig, NetworkConfigBuilder,
+    NetworkRef, VmConfig, VmConfigBuilder, VmStatus,
+};
--- a/harmony/src/modules/kvm/types.rs
+++ b/harmony/src/modules/kvm/types.rs
@@ -0,0 +1,318 @@
+use serde::{Deserialize, Serialize};
+
+/// Specifies how a KVM host is accessed.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub enum KvmConnectionUri {
+    /// Local hypervisor via UNIX socket. Equivalent to `qemu:///system`.
+    Local,
+    /// Remote hypervisor over SSH. Equivalent to `qemu+ssh://user@host/system`.
+    RemoteSsh { host: String, username: String },
+}
+
+impl KvmConnectionUri {
+    /// Returns the libvirt URI string for this connection.
+    pub fn as_uri(&self) -> String {
+        match self {
+            KvmConnectionUri::Local => "qemu:///system".to_string(),
+            KvmConnectionUri::RemoteSsh { host, username } => {
+                format!("qemu+ssh://{username}@{host}/system")
+            }
+        }
+    }
+}
+
+/// Configuration for a virtual disk attached to a VM.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct DiskConfig {
+    /// Disk size in gigabytes.
+    pub size_gb: u32,
+    /// Target device name in the guest (e.g. `vda`, `vdb`).
+    pub device: String,
+    /// Storage pool to allocate the volume from. Defaults to `"default"`.
+    pub pool: String,
+}
+
+/// Configuration for a CD-ROM/ISO device attached to a VM.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct CdromConfig {
+    /// Path or URL to the ISO image. If it starts with `http` or `https`, it will be downloaded.
+    pub source: String,
+    /// Target device name in the guest (e.g. `hda`, `hdb`). Defaults to `hda`.
+    pub device: String,
+}
+
+impl DiskConfig {
+    /// Creates a new disk config with sequential virtio device naming.
+    ///
+    /// `index` maps 0 → `vda`, 1 → `vdb`, etc.
+    pub fn new(size_gb: u32, index: u8) -> Self {
+        let device = format!("vd{}", (b'a' + index) as char);
+        Self {
+            size_gb,
+            device,
+            pool: "default".to_string(),
+        }
+    }
+
+    /// Override the storage pool.
+    pub fn from_pool(mut self, pool: impl Into<String>) -> Self {
+        self.pool = pool.into();
+        self
+    }
+}
+
+/// A reference to a libvirt virtual network by name.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct NetworkRef {
+    /// Libvirt network name (e.g. `"harmonylan"`).
+    pub name: String,
+    /// Optional fixed MAC address for this interface. When `None`, libvirt
+    /// assigns one automatically.
+    pub mac: Option<String>,
+}
+
+impl NetworkRef {
+    pub fn named(name: impl Into<String>) -> Self {
+        Self {
+            name: name.into(),
+            mac: None,
+        }
+    }
+
+    pub fn with_mac(mut self, mac: impl Into<String>) -> Self {
+        self.mac = Some(mac.into());
+        self
+    }
+}
+
+/// Boot device priority entry.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
+pub enum BootDevice {
+    /// Boot from first hard disk (vda)
+    Disk,
+    /// Boot from network (PXE)
+    Network,
+    /// Boot from CD-ROM/ISO
+    Cdrom,
+}
+
+impl BootDevice {
+    pub(crate) fn as_xml_dev(&self) -> &'static str {
+        match self {
+            BootDevice::Disk => "hd",
+            BootDevice::Network => "network",
+            BootDevice::Cdrom => "cdrom",
+        }
+    }
+}
+
+/// Full configuration for a KVM virtual machine.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct VmConfig {
+    /// VM name, must be unique on the host.
+    pub name: String,
+    /// Number of virtual CPUs.
+    pub vcpus: u32,
+    /// Memory in mebibytes (MiB).
+    pub memory_mib: u64,
+    /// Disks to attach, in order.
+    pub disks: Vec<DiskConfig>,
+    /// Network interfaces to attach, in order.
+    pub networks: Vec<NetworkRef>,
+    /// CD-ROM/ISO devices to attach.
+    pub cdroms: Vec<CdromConfig>,
+    /// Boot order. First entry has highest priority.
+    pub boot_order: Vec<BootDevice>,
+}
+
+impl VmConfig {
+    pub fn builder(name: impl Into<String>) -> VmConfigBuilder {
+        VmConfigBuilder::new(name)
+    }
+}
+
+/// Builder for [`VmConfig`].
+#[derive(Debug)]
+pub struct VmConfigBuilder {
+    name: String,
+    vcpus: u32,
+    memory_mib: u64,
+    disks: Vec<DiskConfig>,
+    networks: Vec<NetworkRef>,
+    cdroms: Vec<CdromConfig>,
+    boot_order: Vec<BootDevice>,
+}
+
+impl VmConfigBuilder {
+    pub fn new(name: impl Into<String>) -> Self {
+        Self {
+            name: name.into(),
+            vcpus: 2,
+            memory_mib: 4096,
+            disks: vec![],
+            networks: vec![],
+            cdroms: vec![],
+            boot_order: vec![],
+        }
+    }
+
+    pub fn vcpus(mut self, vcpus: u32) -> Self {
+        self.vcpus = vcpus;
+        self
+    }
+
+    /// Convenience shorthand: sets memory in whole gigabytes.
+    pub fn memory_gb(mut self, gb: u32) -> Self {
+        self.memory_mib = gb as u64 * 1024;
+        self
+    }
+
+    pub fn memory_mib(mut self, mib: u64) -> Self {
+        self.memory_mib = mib;
+        self
+    }
+
+    /// Appends a disk. Devices are named sequentially: `vda`, `vdb`, …
+    pub fn disk(mut self, size_gb: u32) -> Self {
+        let idx = self.disks.len() as u8;
+        self.disks.push(DiskConfig::new(size_gb, idx));
+        self
+    }
+
+    /// Appends a disk with an explicit pool override.
+    pub fn disk_from_pool(mut self, size_gb: u32, pool: impl Into<String>) -> Self {
+        let idx = self.disks.len() as u8;
+        self.disks
+            .push(DiskConfig::new(size_gb, idx).from_pool(pool));
+        self
+    }
+
+    pub fn network(mut self, net: NetworkRef) -> Self {
+        self.networks.push(net);
+        self
+    }
+
+    /// Attaches a CD-ROM with the given ISO source.
+    ///
+    /// The source can be a local path or an HTTP/HTTPS URL that will be
+    /// downloaded to the image directory.
+    pub fn cdrom(mut self, source: impl Into<String>) -> Self {
+        self.cdroms.push(CdromConfig {
+            source: source.into(),
+            device: "hda".to_string(),
+        });
+        self
+    }
+
+    pub fn boot_order(mut self, order: impl IntoIterator<Item = BootDevice>) -> Self {
+        self.boot_order = order.into_iter().collect();
+        self
+    }
+
+    pub fn build(self) -> VmConfig {
+        VmConfig {
+            name: self.name,
+            vcpus: self.vcpus,
+            memory_mib: self.memory_mib,
+            disks: self.disks,
+            networks: self.networks,
+            cdroms: self.cdroms,
+            boot_order: self.boot_order,
+        }
+    }
+}
+
+/// Configuration for an isolated virtual network.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct NetworkConfig {
+    /// Libvirt network name.
+    pub name: String,
+    /// Bridge device name (e.g. `"virbr100"`).
+    pub bridge: String,
+    /// Gateway IP address of the network.
+    pub gateway_ip: String,
+    /// Network prefix length (e.g. `24`).
+    pub prefix_len: u8,
+    /// Forward mode. When `None`, the network is fully isolated.
+    pub forward_mode: Option<ForwardMode>,
+}
+
+/// Libvirt network forward mode.
+#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
+pub enum ForwardMode {
+    Nat,
+    Route,
+}
+
+impl NetworkConfig {
+    pub fn builder(name: impl Into<String>) -> NetworkConfigBuilder {
+        NetworkConfigBuilder::new(name)
+    }
+}
+
+/// Builder for [`NetworkConfig`].
+#[derive(Debug)]
+pub struct NetworkConfigBuilder {
+    name: String,
+    bridge: Option<String>,
+    gateway_ip: String,
+    prefix_len: u8,
+    forward_mode: Option<ForwardMode>,
+}
+
+impl NetworkConfigBuilder {
+    pub fn new(name: impl Into<String>) -> Self {
+        Self {
+            name: name.into(),
+            bridge: None,
+            gateway_ip: "192.168.100.1".to_string(),
+            prefix_len: 24,
+            forward_mode: Some(ForwardMode::Nat),
+        }
+    }
+
+    pub fn bridge(mut self, bridge: impl Into<String>) -> Self {
+        self.bridge = Some(bridge.into());
+        self
+    }
+
+    /// Sets the gateway IP and prefix length (e.g. `"192.168.100.1"`, `24`).
+    pub fn subnet(mut self, gateway_ip: impl Into<String>, prefix_len: u8) -> Self {
+        self.gateway_ip = gateway_ip.into();
+        self.prefix_len = prefix_len;
+        self
+    }
+
+    pub fn isolated(mut self) -> Self {
+        self.forward_mode = None;
+        self
+    }
+
+    pub fn forward(mut self, mode: ForwardMode) -> Self {
+        self.forward_mode = Some(mode);
+        self
+    }
+
+    pub fn build(self) -> NetworkConfig {
+        NetworkConfig {
+            bridge: self
+                .bridge
+                .unwrap_or_else(|| format!("virbr-{}", self.name.replace('-', ""))),
+            name: self.name,
+            gateway_ip: self.gateway_ip,
+            prefix_len: self.prefix_len,
+            forward_mode: self.forward_mode,
+        }
+    }
+}
+
+/// Current state of a VM as returned by libvirt.
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize)]
+pub enum VmStatus {
+    Running,
+    Paused,
+    Shutoff,
+    Crashed,
+    PMSuspended,
+    Other,
+}
--- a/harmony/src/modules/kvm/xml.rs
+++ b/harmony/src/modules/kvm/xml.rs
@@ -0,0 +1,194 @@
+use super::types::{CdromConfig, DiskConfig, ForwardMode, NetworkConfig, VmConfig};
+
+/// Renders the libvirt domain XML for a VM definition.
+///
+/// The caller passes the image directory where qcow2 volumes are stored.
+pub fn domain_xml(vm: &VmConfig, image_dir: &str) -> String {
+    let memory_kib = vm.memory_mib * 1024;
+
+    let os_boot = vm
+        .boot_order
+        .iter()
+        .map(|b| format!("    <boot dev='{}'/>\n", b.as_xml_dev()))
+        .collect::<String>();
+
+    let devices = {
+        let disks = disk_devices(vm, image_dir);
+        let cdroms = cdrom_devices(vm);
+        let nics = nic_devices(vm);
+        format!("{disks}{cdroms}{nics}")
+    };
+
+    format!(
+        r#"<domain type='kvm'>
+  <name>{name}</name>
+  <memory unit='KiB'>{memory_kib}</memory>
+  <vcpu>{vcpus}</vcpu>
+  <os>
+    <type arch='x86_64' machine='q35'>hvm</type>
+{os_boot}  </os>
+  <features>
+    <acpi/>
+    <apic/>
+  </features>
+  <cpu mode='host-model'/>
+  <devices>
+    <emulator>/usr/bin/qemu-system-x86_64</emulator>
+{devices}    <serial type='pty'>
+      <target port='0'/>
+    </serial>
+    <console type='pty'>
+      <target type='serial' port='0'/>
+    </console>
+  </devices>
+</domain>"#,
+        name = vm.name,
+        memory_kib = memory_kib,
+        vcpus = vm.vcpus,
+        os_boot = os_boot,
+        devices = devices,
+    )
+}
+
+fn disk_devices(vm: &VmConfig, image_dir: &str) -> String {
+    vm.disks
+        .iter()
+        .map(|d| format_disk(vm, d, image_dir))
+        .collect()
+}
+
+fn cdrom_devices(vm: &VmConfig) -> String {
+    vm.cdroms.iter().map(|c| format_cdrom(c)).collect()
+}
+
+fn format_disk(vm: &VmConfig, disk: &DiskConfig, image_dir: &str) -> String {
+    let path = format!("{image_dir}/{}-{}.qcow2", vm.name, disk.device);
+    format!(
+        r#"    <disk type='file' device='disk'>
+      <driver name='qemu' type='qcow2'/>
+      <source file='{path}'/>
+      <target dev='{dev}' bus='virtio'/>
+    </disk>
+"#,
+        path = path,
+        dev = disk.device,
+    )
+}
+
+fn format_cdrom(cdrom: &CdromConfig) -> String {
+    let source = &cdrom.source;
+    let dev = &cdrom.device;
+    let device_type = if source.starts_with("http://") || source.starts_with("https://") {
+        "cdrom"
+    } else {
+        "cdrom"
+    };
+    format!(
+        r#"    <disk type='file' device='{device_type}'>
+      <driver name='qemu' type='raw'/>
+      <source file='{source}'/>
+      <target dev='{dev}' bus='ide'/>
+    </disk>
+"#,
+        source = source,
+        dev = dev,
+        device_type = device_type,
+    )
+}
+
+fn nic_devices(vm: &VmConfig) -> String {
+    vm.networks
+        .iter()
+        .map(|net| {
+            let mac_line = net
+                .mac
+                .as_deref()
+                .map(|m| format!("\n      <mac address='{m}'/>"))
+                .unwrap_or_default();
+            format!(
+                r#"    <interface type='network'>
+      <source network='{network}'/>{mac}
+      <model type='virtio'/>
+    </interface>
+"#,
+                network = net.name,
+                mac = mac_line,
+            )
+        })
+        .collect()
+}
+
+/// Renders the libvirt network XML for a virtual network definition.
+pub fn network_xml(cfg: &NetworkConfig) -> String {
+    let forward = match cfg.forward_mode {
+        Some(ForwardMode::Nat) => "  <forward mode='nat'/>\n",
+        Some(ForwardMode::Route) => "  <forward mode='route'/>\n",
+        None => "",
+    };
+
+    format!(
+        r#"<network>
+  <name>{name}</name>
+  <bridge name='{bridge}' stp='on' delay='0'/>
+{forward}  <ip address='{gateway}' prefix='{prefix}'/>
+</network>"#,
+        name = cfg.name,
+        bridge = cfg.bridge,
+        forward = forward,
+        gateway = cfg.gateway_ip,
+        prefix = cfg.prefix_len,
+    )
+}
+
+/// Renders the libvirt storage volume XML for a qcow2 disk.
+pub fn volume_xml(name: &str, size_gb: u32) -> String {
+    let capacity_bytes: u64 = size_gb as u64 * 1024 * 1024 * 1024;
+    format!(
+        r#"<volume>
+  <name>{name}.qcow2</name>
+  <capacity unit='bytes'>{capacity}</capacity>
+  <target>
+    <format type='qcow2'/>
+  </target>
+</volume>"#,
+        name = name,
+        capacity = capacity_bytes,
+    )
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::modules::kvm::types::{BootDevice, NetworkRef, VmConfig};
+
+    #[test]
+    fn domain_xml_contains_vm_name() {
+        let vm = VmConfig::builder("test-vm")
+            .vcpus(2)
+            .memory_gb(4)
+            .disk(20)
+            .network(NetworkRef::named("mynet"))
+            .boot_order([BootDevice::Network, BootDevice::Disk])
+            .build();
+
+        let xml = domain_xml(&vm, "/var/lib/libvirt/images");
+        assert!(xml.contains("<name>test-vm</name>"));
+        assert!(xml.contains("source network='mynet'"));
+        assert!(xml.contains("boot dev='network'"));
+        assert!(xml.contains("boot dev='hd'"));
+    }
+
+    #[test]
+    fn network_xml_isolated_has_no_forward() {
+        use crate::modules::kvm::types::NetworkConfig;
+
+        let cfg = NetworkConfig::builder("testnet")
+            .subnet("10.0.0.1", 24)
+            .isolated()
+            .build();
+
+        let xml = network_xml(&cfg);
+        assert!(!xml.contains("<forward"));
+        assert!(xml.contains("10.0.0.1"));
+    }
+}
--- a/harmony/src/modules/mod.rs
+++ b/harmony/src/modules/mod.rs
@@ -10,6 +10,7 @@ pub mod http;
 pub mod inventory;
 pub mod k3d;
 pub mod k8s;
+pub mod kvm;
 pub mod lamp;
 pub mod load_balancer;
 pub mod monitoring;
--- a/harmony_assets/Cargo.toml
+++ b/harmony_assets/Cargo.toml
@@ -0,0 +1,56 @@
+[package]
+name = "harmony_assets"
+edition = "2024"
+version.workspace = true
+readme.workspace = true
+license.workspace = true
+
+[lib]
+name = "harmony_assets"
+
+[[bin]]
+name = "harmony_assets"
+path = "src/cli/mod.rs"
+required-features = ["cli"]
+
+[features]
+default = ["blake3"]
+sha256 = ["dep:sha2"]
+blake3 = ["dep:blake3"]
+s3 = [
+    "dep:aws-sdk-s3",
+    "dep:aws-config",
+]
+cli = [
+    "dep:clap",
+    "dep:indicatif",
+    "dep:inquire",
+]
+reqwest = ["dep:reqwest"]
+
+[dependencies]
+log.workspace = true
+tokio.workspace = true
+thiserror.workspace = true
+directories.workspace = true
+sha2 = { version = "0.10", optional = true }
+blake3 = { version = "1.5", optional = true }
+reqwest = { version = "0.12", optional = true, default-features = false, features = ["stream", "rustls-tls"] }
+futures-util.workspace = true
+async-trait.workspace = true
+url.workspace = true
+
+# CLI only
+clap = { version = "4.5", features = ["derive"], optional = true }
+indicatif = { version = "0.18", optional = true }
+inquire = { version = "0.7", optional = true }
+
+# S3 only
+aws-sdk-s3 = { version = "1", optional = true }
+aws-config = { version = "1", optional = true }
+
+[dev-dependencies]
+tempfile.workspace = true
+httptest = "0.16"
+pretty_assertions.workspace = true
+tokio-test.workspace = true
--- a/harmony_assets/src/asset.rs
+++ b/harmony_assets/src/asset.rs
@@ -0,0 +1,80 @@
+use crate::hash::ChecksumAlgo;
+use std::path::PathBuf;
+use url::Url;
+
+#[derive(Debug, Clone)]
+pub struct Asset {
+    pub url: Url,
+    pub checksum: String,
+    pub checksum_algo: ChecksumAlgo,
+    pub file_name: String,
+    pub size: Option<u64>,
+}
+
+impl Asset {
+    pub fn new(url: Url, checksum: String, checksum_algo: ChecksumAlgo, file_name: String) -> Self {
+        Self {
+            url,
+            checksum,
+            checksum_algo,
+            file_name,
+            size: None,
+        }
+    }
+
+    pub fn with_size(mut self, size: u64) -> Self {
+        self.size = Some(size);
+        self
+    }
+
+    pub fn formatted_checksum(&self) -> String {
+        crate::hash::format_checksum(&self.checksum, self.checksum_algo.clone())
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct LocalCache {
+    pub base_dir: PathBuf,
+}
+
+impl LocalCache {
+    pub fn new(base_dir: PathBuf) -> Self {
+        Self { base_dir }
+    }
+
+    pub fn path_for(&self, asset: &Asset) -> PathBuf {
+        let prefix = &asset.checksum[..16.min(asset.checksum.len())];
+        self.base_dir.join(prefix).join(&asset.file_name)
+    }
+
+    pub fn cache_key_dir(&self, asset: &Asset) -> PathBuf {
+        let prefix = &asset.checksum[..16.min(asset.checksum.len())];
+        self.base_dir.join(prefix)
+    }
+
+    pub async fn ensure_dir(&self, asset: &Asset) -> Result<(), crate::errors::AssetError> {
+        let dir = self.cache_key_dir(asset);
+        tokio::fs::create_dir_all(&dir)
+            .await
+            .map_err(|e| crate::errors::AssetError::IoError(e))?;
+        Ok(())
+    }
+}
+
+impl Default for LocalCache {
+    fn default() -> Self {
+        let base_dir = directories::ProjectDirs::from("io", "NationTech", "Harmony")
+            .map(|dirs| dirs.cache_dir().join("assets"))
+            .unwrap_or_else(|| PathBuf::from("/tmp/harmony_assets"));
+        Self::new(base_dir)
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct StoredAsset {
+    pub url: Url,
+    pub checksum: String,
+    pub checksum_algo: ChecksumAlgo,
+    pub size: u64,
+    pub key: String,
+}
--- a/harmony_assets/src/cli/checksum.rs
+++ b/harmony_assets/src/cli/checksum.rs
@@ -0,0 +1,25 @@
+use clap::Parser;
+
+#[derive(Parser, Debug)]
+pub struct ChecksumArgs {
+    pub path: String,
+    #[arg(short, long, default_value = "blake3")]
+    pub algo: String,
+}
+
+pub async fn execute(args: ChecksumArgs) -> Result<(), Box<dyn std::error::Error>> {
+    use harmony_assets::{ChecksumAlgo, checksum_for_path};
+
+    let path = std::path::Path::new(&args.path);
+    if !path.exists() {
+        eprintln!("Error: File not found: {}", args.path);
+        std::process::exit(1);
+    }
+
+    let algo = ChecksumAlgo::from_str(&args.algo)?;
+    let checksum = checksum_for_path(path, algo.clone()).await?;
+
+    println!("{}:{}  {}", algo.name(), checksum, args.path);
+
+    Ok(())
+}
--- a/harmony_assets/src/cli/download.rs
+++ b/harmony_assets/src/cli/download.rs
@@ -0,0 +1,82 @@
+use clap::Parser;
+
+#[derive(Parser, Debug)]
+pub struct DownloadArgs {
+    pub url: String,
+    pub checksum: String,
+    #[arg(short, long)]
+    pub output: Option<String>,
+    #[arg(short, long, default_value = "blake3")]
+    pub algo: String,
+}
+
+pub async fn execute(args: DownloadArgs) -> Result<(), Box<dyn std::error::Error>> {
+    use harmony_assets::{
+        Asset, AssetStore, ChecksumAlgo, LocalCache, LocalStore, verify_checksum,
+    };
+    use indicatif::{ProgressBar, ProgressStyle};
+    use url::Url;
+
+    let url = Url::parse(&args.url).map_err(|e| format!("Invalid URL: {}", e))?;
+
+    let file_name = args
+        .output
+        .or_else(|| {
+            std::path::Path::new(&args.url)
+                .file_name()
+                .and_then(|n| n.to_str())
+                .map(|s| s.to_string())
+        })
+        .unwrap_or_else(|| "download".to_string());
+
+    let algo = ChecksumAlgo::from_str(&args.algo)?;
+    let asset = Asset::new(url, args.checksum.clone(), algo.clone(), file_name);
+
+    let cache = LocalCache::default();
+
+    println!("Downloading: {}", asset.url);
+    println!("Checksum: {}:{}", algo.name(), args.checksum);
+    println!("Cache dir: {:?}", cache.base_dir);
+
+    let total_size = asset.size.unwrap_or(0);
+    let pb = if total_size > 0 {
+        let pb = ProgressBar::new(total_size);
+        pb.set_style(
+            ProgressStyle::default_bar()
+                .template("{spinner:.green} [{elapsed_precise}] [{bar:40}] {bytes}/{total_bytes} ({bytes_per_sec})")?
+                .progress_chars("=>-"),
+        );
+        Some(pb)
+    } else {
+        None
+    };
+
+    let progress_fn: Box<dyn Fn(u64, Option<u64>) + Send> = Box::new({
+        let pb = pb.clone();
+        move |bytes, _total| {
+            if let Some(ref pb) = pb {
+                pb.set_position(bytes);
+            }
+        }
+    });
+
+    let store = LocalStore::default();
+    let result = store.fetch(&asset, &cache, Some(progress_fn)).await;
+
+    if let Some(pb) = pb {
+        pb.finish();
+    }
+
+    match result {
+        Ok(path) => {
+            verify_checksum(&path, &args.checksum, algo).await?;
+            println!("\nDownloaded to: {:?}", path);
+            println!("Checksum verified OK");
+            Ok(())
+        }
+        Err(e) => {
+            eprintln!("Download failed: {}", e);
+            std::process::exit(1);
+        }
+    }
+}
--- a/harmony_assets/src/cli/mod.rs
+++ b/harmony_assets/src/cli/mod.rs
@@ -0,0 +1,49 @@
+pub mod checksum;
+pub mod download;
+pub mod upload;
+pub mod verify;
+
+use clap::{Parser, Subcommand};
+
+#[derive(Parser, Debug)]
+#[command(
+    name = "harmony_assets",
+    version,
+    about = "Asset management CLI for downloading, uploading, and verifying large binary assets"
+)]
+pub struct Cli {
+    #[command(subcommand)]
+    pub command: Commands,
+}
+
+#[derive(Subcommand, Debug)]
+pub enum Commands {
+    Upload(upload::UploadArgs),
+    Download(download::DownloadArgs),
+    Checksum(checksum::ChecksumArgs),
+    Verify(verify::VerifyArgs),
+}
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    log::info!("Starting harmony_assets CLI");
+
+    let cli = Cli::parse();
+
+    match cli.command {
+        Commands::Upload(args) => {
+            upload::execute(args).await?;
+        }
+        Commands::Download(args) => {
+            download::execute(args).await?;
+        }
+        Commands::Checksum(args) => {
+            checksum::execute(args).await?;
+        }
+        Commands::Verify(args) => {
+            verify::execute(args).await?;
+        }
+    }
+
+    Ok(())
+}
--- a/harmony_assets/src/cli/upload.rs
+++ b/harmony_assets/src/cli/upload.rs
@@ -0,0 +1,166 @@
+use clap::Parser;
+use harmony_assets::{S3Config, S3Store, checksum_for_path_with_progress};
+use indicatif::{ProgressBar, ProgressStyle};
+use std::path::Path;
+
+#[derive(Parser, Debug)]
+pub struct UploadArgs {
+    pub source: String,
+    pub key: Option<String>,
+    #[arg(short, long)]
+    pub content_type: Option<String>,
+    #[arg(short, long, default_value_t = true)]
+    pub public_read: bool,
+    #[arg(short, long)]
+    pub endpoint: Option<String>,
+    #[arg(short, long)]
+    pub bucket: Option<String>,
+    #[arg(short, long)]
+    pub region: Option<String>,
+    #[arg(short, long)]
+    pub access_key_id: Option<String>,
+    #[arg(short, long)]
+    pub secret_access_key: Option<String>,
+    #[arg(short, long, default_value = "blake3")]
+    pub algo: String,
+    #[arg(short, long, default_value_t = false)]
+    pub yes: bool,
+}
+
+pub async fn execute(args: UploadArgs) -> Result<(), Box<dyn std::error::Error>> {
+    let source_path = Path::new(&args.source);
+    if !source_path.exists() {
+        eprintln!("Error: File not found: {}", args.source);
+        std::process::exit(1);
+    }
+
+    let key = args.key.unwrap_or_else(|| {
+        source_path
+            .file_name()
+            .and_then(|n| n.to_str())
+            .unwrap_or("upload")
+            .to_string()
+    });
+
+    let metadata = tokio::fs::metadata(source_path)
+        .await
+        .map_err(|e| format!("Failed to read file metadata: {}", e))?;
+    let total_size = metadata.len();
+
+    let endpoint = args
+        .endpoint
+        .or_else(|| std::env::var("S3_ENDPOINT").ok())
+        .unwrap_or_default();
+    let bucket = args
+        .bucket
+        .or_else(|| std::env::var("S3_BUCKET").ok())
+        .unwrap_or_else(|| {
+            inquire::Text::new("S3 Bucket name:")
+                .with_default("harmony-assets")
+                .prompt()
+                .unwrap()
+        });
+    let region = args
+        .region
+        .or_else(|| std::env::var("S3_REGION").ok())
+        .unwrap_or_else(|| {
+            inquire::Text::new("S3 Region:")
+                .with_default("us-east-1")
+                .prompt()
+                .unwrap()
+        });
+    let access_key_id = args
+        .access_key_id
+        .or_else(|| std::env::var("AWS_ACCESS_KEY_ID").ok());
+    let secret_access_key = args
+        .secret_access_key
+        .or_else(|| std::env::var("AWS_SECRET_ACCESS_KEY").ok());
+
+    let config = S3Config {
+        endpoint: if endpoint.is_empty() {
+            None
+        } else {
+            Some(endpoint)
+        },
+        bucket: bucket.clone(),
+        region: region.clone(),
+        access_key_id,
+        secret_access_key,
+        public_read: args.public_read,
+    };
+
+    println!("Upload Configuration:");
+    println!("  Source:     {}", args.source);
+    println!("  S3 Key:    {}", key);
+    println!("  Bucket:    {}", bucket);
+    println!("  Region:    {}", region);
+    println!(
+        "  Size:      {} bytes ({} MB)",
+        total_size,
+        total_size as f64 / 1024.0 / 1024.0
+    );
+    println!();
+
+    if !args.yes {
+        let confirm = inquire::Confirm::new("Proceed with upload?")
+            .with_default(true)
+            .prompt()?;
+        if !confirm {
+            println!("Upload cancelled.");
+            return Ok(());
+        }
+    }
+
+    let store = S3Store::new(config)
+        .await
+        .map_err(|e| format!("Failed to initialize S3 client: {}", e))?;
+
+    println!("Computing checksum while uploading...\n");
+
+    let pb = ProgressBar::new(total_size);
+    pb.set_style(
+        ProgressStyle::default_bar()
+            .template("{spinner:.green} [{elapsed_precise}] [{bar:40}] {bytes}/{total_bytes} ({bytes_per_sec})")?
+            .progress_chars("=>-"),
+    );
+
+    {
+        let algo = harmony_assets::ChecksumAlgo::from_str(&args.algo)?;
+        let rt = tokio::runtime::Handle::current();
+        let pb_clone = pb.clone();
+        let _checksum = rt.block_on(checksum_for_path_with_progress(
+            source_path,
+            algo,
+            |read, _total| {
+                pb_clone.set_position(read);
+            },
+        ))?;
+    }
+
+    pb.set_position(total_size);
+
+    let result = store
+        .store(source_path, &key, args.content_type.as_deref())
+        .await;
+
+    pb.finish();
+
+    match result {
+        Ok(asset) => {
+            println!("\nUpload complete!");
+            println!("  URL:        {}", asset.url);
+            println!(
+                "  Checksum:   {}:{}",
+                asset.checksum_algo.name(),
+                asset.checksum
+            );
+            println!("  Size:       {} bytes", asset.size);
+            println!("  Key:        {}", asset.key);
+            Ok(())
+        }
+        Err(e) => {
+            eprintln!("Upload failed: {}", e);
+            std::process::exit(1);
+        }
+    }
+}
--- a/harmony_assets/src/cli/verify.rs
+++ b/harmony_assets/src/cli/verify.rs
@@ -0,0 +1,32 @@
+use clap::Parser;
+
+#[derive(Parser, Debug)]
+pub struct VerifyArgs {
+    pub path: String,
+    pub expected: String,
+    #[arg(short, long, default_value = "blake3")]
+    pub algo: String,
+}
+
+pub async fn execute(args: VerifyArgs) -> Result<(), Box<dyn std::error::Error>> {
+    use harmony_assets::{ChecksumAlgo, verify_checksum};
+
+    let path = std::path::Path::new(&args.path);
+    if !path.exists() {
+        eprintln!("Error: File not found: {}", args.path);
+        std::process::exit(1);
+    }
+
+    let algo = ChecksumAlgo::from_str(&args.algo)?;
+
+    match verify_checksum(path, &args.expected, algo).await {
+        Ok(()) => {
+            println!("Checksum verified OK");
+            Ok(())
+        }
+        Err(e) => {
+            eprintln!("Verification FAILED: {}", e);
+            std::process::exit(1);
+        }
+    }
+}
--- a/harmony_assets/src/errors.rs
+++ b/harmony_assets/src/errors.rs
@@ -0,0 +1,37 @@
+use std::path::PathBuf;
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum AssetError {
+    #[error("File not found: {0}")]
+    FileNotFound(PathBuf),
+
+    #[error("Checksum mismatch for '{path}': expected {expected}, got {actual}")]
+    ChecksumMismatch {
+        path: PathBuf,
+        expected: String,
+        actual: String,
+    },
+
+    #[error("Checksum algorithm not available: {0}. Enable the corresponding feature flag.")]
+    ChecksumAlgoNotAvailable(String),
+
+    #[error("Download failed: {0}")]
+    DownloadFailed(String),
+
+    #[error("S3 error: {0}")]
+    S3Error(String),
+
+    #[error("IO error: {0}")]
+    IoError(#[from] std::io::Error),
+
+    #[cfg(feature = "reqwest")]
+    #[error("HTTP error: {0}")]
+    HttpError(#[from] reqwest::Error),
+
+    #[error("Store error: {0}")]
+    StoreError(String),
+
+    #[error("Configuration error: {0}")]
+    ConfigError(String),
+}
--- a/harmony_assets/src/hash.rs
+++ b/harmony_assets/src/hash.rs
@@ -0,0 +1,233 @@
+use crate::errors::AssetError;
+use std::path::Path;
+
+#[cfg(feature = "blake3")]
+use blake3::Hasher as B3Hasher;
+#[cfg(feature = "sha256")]
+use sha2::{Digest, Sha256};
+
+#[derive(Debug, Clone)]
+pub enum ChecksumAlgo {
+    BLAKE3,
+    SHA256,
+}
+
+impl Default for ChecksumAlgo {
+    fn default() -> Self {
+        #[cfg(feature = "blake3")]
+        return ChecksumAlgo::BLAKE3;
+        #[cfg(not(feature = "blake3"))]
+        return ChecksumAlgo::SHA256;
+    }
+}
+
+impl ChecksumAlgo {
+    pub fn name(&self) -> &'static str {
+        match self {
+            ChecksumAlgo::BLAKE3 => "blake3",
+            ChecksumAlgo::SHA256 => "sha256",
+        }
+    }
+
+    pub fn from_str(s: &str) -> Result<Self, AssetError> {
+        match s.to_lowercase().as_str() {
+            "blake3" | "b3" => Ok(ChecksumAlgo::BLAKE3),
+            "sha256" | "sha-256" => Ok(ChecksumAlgo::SHA256),
+            _ => Err(AssetError::ChecksumAlgoNotAvailable(s.to_string())),
+        }
+    }
+}
+
+impl std::fmt::Display for ChecksumAlgo {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        write!(f, "{}", self.name())
+    }
+}
+
+pub async fn checksum_for_file<R>(reader: R, algo: ChecksumAlgo) -> Result<String, AssetError>
+where
+    R: tokio::io::AsyncRead + Unpin,
+{
+    match algo {
+        #[cfg(feature = "blake3")]
+        ChecksumAlgo::BLAKE3 => {
+            let mut hasher = B3Hasher::new();
+            let mut reader = reader;
+            let mut buf = vec![0u8; 65536];
+            loop {
+                let n = tokio::io::AsyncReadExt::read(&mut reader, &mut buf).await?;
+                if n == 0 {
+                    break;
+                }
+                hasher.update(&buf[..n]);
+            }
+            Ok(hasher.finalize().to_hex().to_string())
+        }
+        #[cfg(not(feature = "blake3"))]
+        ChecksumAlgo::BLAKE3 => Err(AssetError::ChecksumAlgoNotAvailable("blake3".to_string())),
+        #[cfg(feature = "sha256")]
+        ChecksumAlgo::SHA256 => {
+            let mut hasher = Sha256::new();
+            let mut reader = reader;
+            let mut buf = vec![0u8; 65536];
+            loop {
+                let n = tokio::io::AsyncReadExt::read(&mut reader, &mut buf).await?;
+                if n == 0 {
+                    break;
+                }
+                hasher.update(&buf[..n]);
+            }
+            Ok(format!("{:x}", hasher.finalize()))
+        }
+        #[cfg(not(feature = "sha256"))]
+        ChecksumAlgo::SHA256 => Err(AssetError::ChecksumAlgoNotAvailable("sha256".to_string())),
+    }
+}
+
+pub async fn checksum_for_path(path: &Path, algo: ChecksumAlgo) -> Result<String, AssetError> {
+    let file = tokio::fs::File::open(path)
+        .await
+        .map_err(|e| AssetError::IoError(e))?;
+    let reader = tokio::io::BufReader::with_capacity(65536, file);
+    checksum_for_file(reader, algo).await
+}
+
+pub async fn checksum_for_path_with_progress<F>(
+    path: &Path,
+    algo: ChecksumAlgo,
+    mut progress: F,
+) -> Result<String, AssetError>
+where
+    F: FnMut(u64, Option<u64>) + Send,
+{
+    let file = tokio::fs::File::open(path)
+        .await
+        .map_err(|e| AssetError::IoError(e))?;
+    let metadata = file.metadata().await.map_err(|e| AssetError::IoError(e))?;
+    let total = Some(metadata.len());
+    let reader = tokio::io::BufReader::with_capacity(65536, file);
+
+    match algo {
+        #[cfg(feature = "blake3")]
+        ChecksumAlgo::BLAKE3 => {
+            let mut hasher = B3Hasher::new();
+            let mut reader = reader;
+            let mut buf = vec![0u8; 65536];
+            let mut read: u64 = 0;
+            loop {
+                let n = tokio::io::AsyncReadExt::read(&mut reader, &mut buf).await?;
+                if n == 0 {
+                    break;
+                }
+                hasher.update(&buf[..n]);
+                read += n as u64;
+                progress(read, total);
+            }
+            Ok(hasher.finalize().to_hex().to_string())
+        }
+        #[cfg(not(feature = "blake3"))]
+        ChecksumAlgo::BLAKE3 => Err(AssetError::ChecksumAlgoNotAvailable("blake3".to_string())),
+        #[cfg(feature = "sha256")]
+        ChecksumAlgo::SHA256 => {
+            let mut hasher = Sha256::new();
+            let mut reader = reader;
+            let mut buf = vec![0u8; 65536];
+            let mut read: u64 = 0;
+            loop {
+                let n = tokio::io::AsyncReadExt::read(&mut reader, &mut buf).await?;
+                if n == 0 {
+                    break;
+                }
+                hasher.update(&buf[..n]);
+                read += n as u64;
+                progress(read, total);
+            }
+            Ok(format!("{:x}", hasher.finalize()))
+        }
+        #[cfg(not(feature = "sha256"))]
+        ChecksumAlgo::SHA256 => Err(AssetError::ChecksumAlgoNotAvailable("sha256".to_string())),
+    }
+}
+
+pub async fn verify_checksum(
+    path: &Path,
+    expected: &str,
+    algo: ChecksumAlgo,
+) -> Result<(), AssetError> {
+    let actual = checksum_for_path(path, algo).await?;
+    let expected_clean = expected
+        .trim_start_matches("blake3:")
+        .trim_start_matches("sha256:")
+        .trim_start_matches("b3:")
+        .trim_start_matches("sha-256:");
+    if actual != expected_clean {
+        return Err(AssetError::ChecksumMismatch {
+            path: path.to_path_buf(),
+            expected: expected_clean.to_string(),
+            actual,
+        });
+    }
+    Ok(())
+}
+
+pub fn format_checksum(checksum: &str, algo: ChecksumAlgo) -> String {
+    format!("{}:{}", algo.name(), checksum)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use std::io::Write;
+    use tempfile::NamedTempFile;
+
+    async fn create_temp_file(content: &[u8]) -> NamedTempFile {
+        let mut file = NamedTempFile::new().unwrap();
+        file.write_all(content).unwrap();
+        file.flush().unwrap();
+        file
+    }
+
+    #[tokio::test]
+    async fn test_checksum_blake3() {
+        let file = create_temp_file(b"hello world").await;
+        let checksum = checksum_for_path(file.path(), ChecksumAlgo::BLAKE3)
+            .await
+            .unwrap();
+        assert_eq!(
+            checksum,
+            "d74981efa70a0c880b8d8c1985d075dbcbf679b99a5f9914e5aaf96b831a9e24"
+        );
+    }
+
+    #[tokio::test]
+    async fn test_verify_checksum_success() {
+        let file = create_temp_file(b"hello world").await;
+        let checksum = checksum_for_path(file.path(), ChecksumAlgo::BLAKE3)
+            .await
+            .unwrap();
+        let result = verify_checksum(file.path(), &checksum, ChecksumAlgo::BLAKE3).await;
+        assert!(result.is_ok());
+    }
+
+    #[tokio::test]
+    async fn test_verify_checksum_failure() {
+        let file = create_temp_file(b"hello world").await;
+        let result = verify_checksum(
+            file.path(),
+            "blake3:0000000000000000000000000000000000000000000000000000000000000000",
+            ChecksumAlgo::BLAKE3,
+        )
+        .await;
+        assert!(matches!(result, Err(AssetError::ChecksumMismatch { .. })));
+    }
+
+    #[tokio::test]
+    async fn test_checksum_with_prefix() {
+        let file = create_temp_file(b"hello world").await;
+        let checksum = checksum_for_path(file.path(), ChecksumAlgo::BLAKE3)
+            .await
+            .unwrap();
+        let formatted = format_checksum(&checksum, ChecksumAlgo::BLAKE3);
+        assert!(formatted.starts_with("blake3:"));
+    }
+}
--- a/harmony_assets/src/lib.rs
+++ b/harmony_assets/src/lib.rs
@@ -0,0 +1,14 @@
+pub mod asset;
+pub mod errors;
+pub mod hash;
+pub mod store;
+
+pub use asset::{Asset, LocalCache, StoredAsset};
+pub use errors::AssetError;
+pub use hash::{ChecksumAlgo, checksum_for_path, checksum_for_path_with_progress, verify_checksum};
+pub use store::AssetStore;
+
+#[cfg(feature = "s3")]
+pub use store::{S3Config, S3Store};
+
+pub use store::local::LocalStore;
--- a/harmony_assets/src/store/local.rs
+++ b/harmony_assets/src/store/local.rs
@@ -0,0 +1,137 @@
+use crate::asset::{Asset, LocalCache};
+use crate::errors::AssetError;
+use crate::store::AssetStore;
+use async_trait::async_trait;
+use std::path::PathBuf;
+use url::Url;
+
+#[cfg(feature = "reqwest")]
+use crate::hash::verify_checksum;
+
+#[derive(Debug, Clone)]
+pub struct LocalStore {
+    base_dir: PathBuf,
+}
+
+impl LocalStore {
+    pub fn new(base_dir: PathBuf) -> Self {
+        Self { base_dir }
+    }
+
+    pub fn with_cache(cache: LocalCache) -> Self {
+        Self {
+            base_dir: cache.base_dir.clone(),
+        }
+    }
+
+    pub fn base_dir(&self) -> &PathBuf {
+        &self.base_dir
+    }
+}
+
+impl Default for LocalStore {
+    fn default() -> Self {
+        Self::new(LocalCache::default().base_dir)
+    }
+}
+
+#[async_trait]
+impl AssetStore for LocalStore {
+    #[cfg(feature = "reqwest")]
+    async fn fetch(
+        &self,
+        asset: &Asset,
+        cache: &LocalCache,
+        progress: Option<Box<dyn Fn(u64, Option<u64>) + Send>>,
+    ) -> Result<PathBuf, AssetError> {
+        use futures_util::StreamExt;
+
+        let dest_path = cache.path_for(asset);
+
+        if dest_path.exists() {
+            let verification =
+                verify_checksum(&dest_path, &asset.checksum, asset.checksum_algo.clone()).await;
+            if verification.is_ok() {
+                log::debug!("Asset already cached at {:?}", dest_path);
+                return Ok(dest_path);
+            } else {
+                log::warn!("Cached file failed checksum verification, re-downloading");
+                tokio::fs::remove_file(&dest_path)
+                    .await
+                    .map_err(|e| AssetError::IoError(e))?;
+            }
+        }
+
+        cache.ensure_dir(asset).await?;
+
+        log::info!("Downloading asset from {}", asset.url);
+        let client = reqwest::Client::new();
+        let response = client
+            .get(asset.url.as_str())
+            .send()
+            .await
+            .map_err(|e| AssetError::DownloadFailed(e.to_string()))?;
+
+        if !response.status().is_success() {
+            return Err(AssetError::DownloadFailed(format!(
+                "HTTP {}: {}",
+                response.status(),
+                asset.url
+            )));
+        }
+
+        let total_size = response.content_length();
+
+        let mut file = tokio::fs::File::create(&dest_path)
+            .await
+            .map_err(|e| AssetError::IoError(e))?;
+
+        let mut stream = response.bytes_stream();
+        let mut downloaded: u64 = 0;
+
+        while let Some(chunk_result) = stream.next().await {
+            let chunk = chunk_result.map_err(|e| AssetError::DownloadFailed(e.to_string()))?;
+            tokio::io::AsyncWriteExt::write_all(&mut file, &chunk)
+                .await
+                .map_err(|e| AssetError::IoError(e))?;
+            downloaded += chunk.len() as u64;
+            if let Some(ref p) = progress {
+                p(downloaded, total_size);
+            }
+        }
+
+        tokio::io::AsyncWriteExt::flush(&mut file)
+            .await
+            .map_err(|e| AssetError::IoError(e))?;
+
+        drop(file);
+
+        verify_checksum(&dest_path, &asset.checksum, asset.checksum_algo.clone()).await?;
+
+        log::info!("Asset downloaded and verified: {:?}", dest_path);
+        Ok(dest_path)
+    }
+
+    #[cfg(not(feature = "reqwest"))]
+    async fn fetch(
+        &self,
+        _asset: &Asset,
+        _cache: &LocalCache,
+        _progress: Option<Box<dyn Fn(u64, Option<u64>) + Send>>,
+    ) -> Result<PathBuf, AssetError> {
+        Err(AssetError::DownloadFailed(
+            "HTTP downloads not available. Enable the 'reqwest' feature.".to_string(),
+        ))
+    }
+
+    async fn exists(&self, key: &str) -> Result<bool, AssetError> {
+        let path = self.base_dir.join(key);
+        Ok(path.exists())
+    }
+
+    fn url_for(&self, key: &str) -> Result<Url, AssetError> {
+        let path = self.base_dir.join(key);
+        Url::from_file_path(&path)
+            .map_err(|_| AssetError::StoreError("Could not convert path to file URL".to_string()))
+    }
+}
--- a/harmony_assets/src/store/mod.rs
+++ b/harmony_assets/src/store/mod.rs
@@ -0,0 +1,27 @@
+use crate::asset::{Asset, LocalCache};
+use crate::errors::AssetError;
+use async_trait::async_trait;
+use std::path::PathBuf;
+use url::Url;
+
+pub mod local;
+
+#[cfg(feature = "s3")]
+pub mod s3;
+
+#[async_trait]
+pub trait AssetStore: Send + Sync {
+    async fn fetch(
+        &self,
+        asset: &Asset,
+        cache: &LocalCache,
+        progress: Option<Box<dyn Fn(u64, Option<u64>) + Send>>,
+    ) -> Result<PathBuf, AssetError>;
+
+    async fn exists(&self, key: &str) -> Result<bool, AssetError>;
+
+    fn url_for(&self, key: &str) -> Result<Url, AssetError>;
+}
+
+#[cfg(feature = "s3")]
+pub use s3::{S3Config, S3Store};
--- a/harmony_assets/src/store/s3.rs
+++ b/harmony_assets/src/store/s3.rs
@@ -0,0 +1,235 @@
+use crate::asset::StoredAsset;
+use crate::errors::AssetError;
+use crate::hash::ChecksumAlgo;
+use async_trait::async_trait;
+use aws_sdk_s3::Client as S3Client;
+use aws_sdk_s3::primitives::ByteStream;
+use aws_sdk_s3::types::ObjectCannedAcl;
+use std::path::Path;
+use url::Url;
+
+#[derive(Debug, Clone)]
+pub struct S3Config {
+    pub endpoint: Option<String>,
+    pub bucket: String,
+    pub region: String,
+    pub access_key_id: Option<String>,
+    pub secret_access_key: Option<String>,
+    pub public_read: bool,
+}
+
+impl Default for S3Config {
+    fn default() -> Self {
+        Self {
+            endpoint: None,
+            bucket: String::new(),
+            region: String::from("us-east-1"),
+            access_key_id: None,
+            secret_access_key: None,
+            public_read: true,
+        }
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct S3Store {
+    client: S3Client,
+    config: S3Config,
+}
+
+impl S3Store {
+    pub async fn new(config: S3Config) -> Result<Self, AssetError> {
+        let mut cfg_builder = aws_config::defaults(aws_config::BehaviorVersion::latest());
+
+        if let Some(ref endpoint) = config.endpoint {
+            cfg_builder = cfg_builder.endpoint_url(endpoint);
+        }
+
+        let cfg = cfg_builder.load().await;
+        let client = S3Client::new(&cfg);
+
+        Ok(Self { client, config })
+    }
+
+    pub fn config(&self) -> &S3Config {
+        &self.config
+    }
+
+    fn public_url(&self, key: &str) -> Result<Url, AssetError> {
+        let url_str = if let Some(ref endpoint) = self.config.endpoint {
+            format!(
+                "{}/{}/{}",
+                endpoint.trim_end_matches('/'),
+                self.config.bucket,
+                key
+            )
+        } else {
+            format!(
+                "https://{}.s3.{}.amazonaws.com/{}",
+                self.config.bucket, self.config.region, key
+            )
+        };
+        Url::parse(&url_str).map_err(|e| AssetError::S3Error(e.to_string()))
+    }
+
+    pub async fn store(
+        &self,
+        source: &Path,
+        key: &str,
+        content_type: Option<&str>,
+    ) -> Result<StoredAsset, AssetError> {
+        let metadata = tokio::fs::metadata(source)
+            .await
+            .map_err(|e| AssetError::IoError(e))?;
+        let size = metadata.len();
+
+        let checksum = crate::checksum_for_path(source, ChecksumAlgo::default())
+            .await
+            .map_err(|e| AssetError::StoreError(e.to_string()))?;
+
+        let body = ByteStream::from_path(source).await.map_err(|e| {
+            AssetError::IoError(std::io::Error::new(
+                std::io::ErrorKind::Other,
+                e.to_string(),
+            ))
+        })?;
+
+        let mut put_builder = self
+            .client
+            .put_object()
+            .bucket(&self.config.bucket)
+            .key(key)
+            .body(body)
+            .content_length(size as i64)
+            .metadata("checksum", &checksum);
+
+        if self.config.public_read {
+            put_builder = put_builder.acl(ObjectCannedAcl::PublicRead);
+        }
+
+        if let Some(ct) = content_type {
+            put_builder = put_builder.content_type(ct);
+        }
+
+        put_builder
+            .send()
+            .await
+            .map_err(|e| AssetError::S3Error(e.to_string()))?;
+
+        Ok(StoredAsset {
+            url: self.public_url(key)?,
+            checksum,
+            checksum_algo: ChecksumAlgo::default(),
+            size,
+            key: key.to_string(),
+        })
+    }
+}
+
+use crate::store::AssetStore;
+use crate::{Asset, LocalCache};
+
+#[async_trait]
+impl AssetStore for S3Store {
+    async fn fetch(
+        &self,
+        asset: &Asset,
+        cache: &LocalCache,
+        progress: Option<Box<dyn Fn(u64, Option<u64>) + Send>>,
+    ) -> Result<std::path::PathBuf, AssetError> {
+        let dest_path = cache.path_for(asset);
+
+        if dest_path.exists() {
+            let verification =
+                crate::verify_checksum(&dest_path, &asset.checksum, asset.checksum_algo.clone())
+                    .await;
+            if verification.is_ok() {
+                log::debug!("Asset already cached at {:?}", dest_path);
+                return Ok(dest_path);
+            }
+        }
+
+        cache.ensure_dir(asset).await?;
+
+        log::info!(
+            "Downloading asset from s3://{}/{}",
+            self.config.bucket,
+            asset.url
+        );
+
+        let key = extract_s3_key(&asset.url, &self.config.bucket)?;
+        let obj = self
+            .client
+            .get_object()
+            .bucket(&self.config.bucket)
+            .key(&key)
+            .send()
+            .await
+            .map_err(|e| AssetError::S3Error(e.to_string()))?;
+
+        let total_size = obj.content_length.unwrap_or(0) as u64;
+        let mut file = tokio::fs::File::create(&dest_path)
+            .await
+            .map_err(|e| AssetError::IoError(e))?;
+
+        let mut stream = obj.body;
+        let mut downloaded: u64 = 0;
+
+        while let Some(chunk_result) = stream.next().await {
+            let chunk = chunk_result.map_err(|e| AssetError::S3Error(e.to_string()))?;
+            tokio::io::AsyncWriteExt::write_all(&mut file, &chunk)
+                .await
+                .map_err(|e| AssetError::IoError(e))?;
+            downloaded += chunk.len() as u64;
+            if let Some(ref p) = progress {
+                p(downloaded, Some(total_size));
+            }
+        }
+
+        tokio::io::AsyncWriteExt::flush(&mut file)
+            .await
+            .map_err(|e| AssetError::IoError(e))?;
+
+        drop(file);
+
+        crate::verify_checksum(&dest_path, &asset.checksum, asset.checksum_algo.clone()).await?;
+
+        Ok(dest_path)
+    }
+
+    async fn exists(&self, key: &str) -> Result<bool, AssetError> {
+        match self
+            .client
+            .head_object()
+            .bucket(&self.config.bucket)
+            .key(key)
+            .send()
+            .await
+        {
+            Ok(_) => Ok(true),
+            Err(e) => {
+                let err_str = e.to_string();
+                if err_str.contains("NoSuchKey") || err_str.contains("NotFound") {
+                    Ok(false)
+                } else {
+                    Err(AssetError::S3Error(err_str))
+                }
+            }
+        }
+    }
+
+    fn url_for(&self, key: &str) -> Result<Url, AssetError> {
+        self.public_url(key)
+    }
+}
+
+fn extract_s3_key(url: &Url, bucket: &str) -> Result<String, AssetError> {
+    let path = url.path().trim_start_matches('/');
+    if let Some(stripped) = path.strip_prefix(&format!("{}/", bucket)) {
+        Ok(stripped.to_string())
+    } else if path == bucket {
+        Ok(String::new())
+    } else {
+        Ok(path.to_string())
+    }
+}
--- a/harmony_config/src/lib.rs
+++ b/harmony_config/src/lib.rs
@@ -216,15 +216,6 @@ mod tests {
        const KEY: &'static str = "TestConfig";
    }

-    #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, PartialEq)]
-    struct AnotherTestConfig {
-        value: String,
-    }
-
-    impl Config for AnotherTestConfig {
-        const KEY: &'static str = "AnotherTestConfig";
-    }
-
    struct MockSource {
        data: std::sync::Mutex<std::collections::HashMap<String, serde_json::Value>>,
        get_count: AtomicUsize,
--- a/harmony_config/src/source/prompt.rs
+++ b/harmony_config/src/source/prompt.rs
@@ -1,11 +1,8 @@
 use async_trait::async_trait;
 use std::sync::Arc;
-use tokio::sync::Mutex;

 use crate::{ConfigError, ConfigSource};

-static PROMPT_MUTEX: Mutex<()> = Mutex::const_new(());
-
 pub struct PromptSource {
    #[allow(dead_code)]
    writer: Option<Arc<dyn std::io::Write + Send + Sync>>,
@@ -40,11 +37,3 @@ impl ConfigSource for PromptSource {
        Ok(())
    }
 }
-
-pub async fn with_prompt_lock<F, T>(f: F) -> Result<T, ConfigError>
-where
-    F: std::future::Future<Output = Result<T, ConfigError>>,
-{
-    let _guard = PROMPT_MUTEX.lock().await;
-    f.await
-}
--- a/opencode.json
+++ b/opencode.json
@@ -0,0 +1,17 @@
+{
+  "$schema": "https://opencode.ai/config.json",
+  "provider": {
+    "ollama": {
+      "npm": "@ai-sdk/openai-compatible",
+      "name": "Ollama sto1",
+      "options": {
+        "baseURL": "http://192.168.55.132:11434/v1"
+      },
+      "models": {
+        "qwen3-coder-next:q4_K_M": {
+          "name": "qwen3-coder-next:q4_K_M"
+        }
+      }
+    }
+  }
+}
Author	SHA1	Message	Date
Jean-Gabriel Gill-Couture	6ca8663422	wip: Roadmap for config	2026-03-22 16:57:36 -04:00
Jean-Gabriel Gill-Couture	f6ce0c6d4f	chore: Harmony short term roadmap	2026-03-22 11:43:43 -04:00
Jean-Gabriel Gill-Couture	8a1eca21f7	Merge branch 'feat/harmony_assets' into feature/kvm-module	2026-03-22 11:26:04 -04:00
Jean-Gabriel Gill-Couture	9d2308eca6	Merge remote-tracking branch 'origin/master' into feature/kvm-module All checks were successful Run Check Script / check (pull_request) Successful in 1m48s Details	2026-03-22 10:02:10 -04:00
Jean-Gabriel Gill-Couture	ccc26e07eb	feat: harmony_asset crate to manage assets, local, s3, http urls, etc Some checks failed Run Check Script / check (pull_request) Failing after 17s Details	2026-03-21 11:10:51 -04:00
Jean-Gabriel Gill-Couture	8798110bf3	feat: linux vm example with cdrom boot and iso download features All checks were successful Run Check Script / check (pull_request) Successful in 1m32s Details	2026-03-08 21:48:04 -04:00
Jean-Gabriel Gill-Couture	1508d431c0	refactor: kvm module now efficiently encapsulate libvirt complexity behind builder patterns, no more xml	2026-03-08 12:08:19 -04:00
Jean-Gabriel Gill-Couture	caf6f0c67b	Add KVM module for managing virtual machines - KVM module with connection configuration (local/SSH) - VM lifecycle management (create/start/stop/destroy/delete) - Network management (create/delete isolated virtual networks) - Volume management (create/delete storage volumes) - Example: OKD HA cluster deployment with OPNsense firewall - All VMs configured for PXE boot with isolated network The KVM module uses virsh command-line tools for management and is fully integrated with Harmony's architecture. It provides a clean Rust API for defining VMs, networks, and volumes. The example demonstrates deploying a complete OKD high-availability cluster (3 control planes, 3 workers) plus OPNsense firewall on an isolated network.	2026-03-08 08:06:10 -04:00