harmony/docs/coding-guide.md

# Harmony Coding Guide

Harmony is an infrastructure automation framework. It is **code-first and code-only**: operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Good code here means a good operator experience.

### Concrete context

We use here the context of the KVM module to explain the coding style. This will make it very easy to understand and should translate quite well to other modules/contexts managed by Harmony like OPNSense and Kubernetes.

## Core Philosophy

### The Careful Craftsman Principle

Harmony is a powerful framework that does a lot. With that power comes responsibility. Every abstraction, every trait, every module must earn its place. Before adding anything, ask:

1. **Does this solve a real problem users have?** Not a theoretical problem, an actual one encountered in production.
2. **Is this the simplest solution that works?** Complexity is a cost that compounds over time.
3. **Will this make the next developer's life easier or harder?** Code is read far more often than written.

When in doubt, don't abstract. Wait for the pattern to emerge from real usage. A little duplication is better than the wrong abstraction.

### High-level functions over raw primitives

Callers should not need to know about underlying protocols, XML schemas, or API quirks. A function that deploys a VM should accept meaningful parameters like CPU count, memory, and network name — not XML strings.

```rust
// Bad: caller constructs XML and passes it to a thin wrapper
let xml = format!(r#"<domain type='kvm'>...</domain>"#, name, memory_kb, ...);
executor.create_vm(&xml).await?;

// Good: caller describes intent, the module handles representation
executor.define_vm(&VmConfig::builder("my-vm")
    .cpu(4)
    .memory_gb(8)
    .disk(DiskConfig::new(50))
    .network(NetworkRef::named("mylan"))
    .boot_order([BootDevice::Network, BootDevice::Disk])
    .build())
    .await?;
```

The module owns the XML, the virsh invocations, the API calls — not the caller.

### Use the right abstraction layer

Prefer native library bindings over shelling out to CLI tools. The `virt` crate provides direct libvirt bindings and should be used instead of spawning `virsh` subprocesses.

- CLI subprocess calls are fragile: stdout/stderr parsing, exit codes, quoting, PATH differences
- Native bindings give typed errors, no temp files, no shell escaping
- `virt::connect::Connect` opens a connection; `virt::domain::Domain` manages VMs; `virt::network::Network` manages virtual networks

### Keep functions small and well-named

Each function should do one thing. If a function is doing two conceptually separate things, split it. Function names should read like plain English: `ensure_network_active`, `define_vm`, `vm_is_running`.

### Prefer short modules over large files

Group related types and functions by concept. A module that handles one resource (e.g., network, domain, storage) is better than a single file for everything.

---

## Error Handling

### Use `thiserror` for all error types

Define error types with `thiserror::Error`. This removes the boilerplate of implementing `Display` and `std::error::Error` by hand, keeps error messages close to their variants, and makes types easy to extend.

```rust
// Bad: hand-rolled Display + std::error::Error
#[derive(Debug)]
pub enum KVMError {
    ConnectionError(String),
    VMNotFound(String),
}

impl std::fmt::Display for KVMError { ... }
impl std::error::Error for KVMError {}

// Good: derive Display via thiserror
#[derive(thiserror::Error, Debug)]
pub enum KVMError {
    #[error("connection failed: {0}")]
    ConnectionFailed(String),
    #[error("VM not found: {name}")]
    VmNotFound { name: String },
}
```

### Make bubbling errors easy with `?` and `From`

`?` works on any error type for which there is a `From` impl. Add `From` conversions from lower-level errors into your module's error type so callers can use `?` without boilerplate.

With `thiserror`, wrapping a foreign error is one line:

```rust
#[derive(thiserror::Error, Debug)]
pub enum KVMError {
    #[error("libvirt error: {0}")]
    Libvirt(#[from] virt::error::Error),

    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
}
```

This means a call that returns `virt::error::Error` can be `?`-propagated into a `Result<_, KVMError>` without any `.map_err(...)`.

### Typed errors over stringly-typed errors

Avoid `Box<dyn Error>` or `String` as error return types in library code. Callers need to distinguish errors programmatically — `KVMError::VmAlreadyExists` is actionable, `"VM already exists: foo"` as a `String` is not.

At binary entry points (e.g., `main`) it is acceptable to convert to `String` or `anyhow::Error` for display.

---

## Logging

### Use the `log` crate macros

All log output must go through the `log` crate. Never use `println!`, `eprintln!`, or `dbg!` in library code. This makes output compatible with any logging backend (env_logger, tracing, structured logging, etc.).

```rust
// Bad
println!("Creating VM: {}", name);

// Good
use log::{info, debug, warn};
info!("Creating VM: {name}");
debug!("VM XML:\n{xml}");
warn!("Network already active, skipping creation");
```

Use the right level:

| Level   | When to use |
|---------|-------------|
| `error` | Unrecoverable failures (before returning Err) |
| `warn`  | Recoverable issues, skipped steps |
| `info`  | High-level progress events visible in normal operation |
| `debug` | Detailed operational info useful for debugging |
| `trace` | Very granular, per-iteration or per-call data |

Log before significant operations and after unexpected conditions. Do not log inside tight loops at `info` level.

---

## Types and Builders

### Derive `Serialize` on all public domain types

All public structs and enums that represent configuration or state should derive `serde::Serialize`. Add `Deserialize` when round-trip serialization is needed.

### Builder pattern for complex configs

When a type has more than three fields or optional fields, provide a builder. The builder pattern allows named, incremental construction without positional arguments.

```rust
let config = VmConfig::builder("bootstrap")
    .cpu(4)
    .memory_gb(8)
    .disk(DiskConfig::new(50).labeled("os"))
    .disk(DiskConfig::new(100).labeled("data"))
    .network(NetworkRef::named("harmonylan"))
    .boot_order([BootDevice::Network, BootDevice::Disk])
    .build();
```

### Avoid `pub` fields on config structs

Expose data through methods or the builder, not raw field access. This preserves the ability to validate, rename, or change representation without breaking callers.

---

## Async

### Use `tokio` for all async runtime needs

All async code runs on tokio. Use `tokio::spawn`, `tokio::time`, etc. Use `#[async_trait]` for traits with async methods.

### No blocking in async context

Never call blocking I/O (file I/O, network, process spawn) directly in an async function. Use `tokio::fs`, `tokio::process`, or `tokio::task::spawn_blocking` as appropriate.

---

## Module Structure

### Follow the `Score` / `Interpret` pattern

Modules that represent deployable infrastructure should implement `Score<T: Topology>` and `Interpret<T>`:

- `Score` is the serializable, clonable configuration declaring *what* to deploy
- `Interpret` does the actual work when `execute()` is called

```rust
pub struct KvmScore {
    network: NetworkConfig,
    vms: Vec<VmConfig>,
}

impl<T: Topology + KvmHost> Score<T> for KvmScore {
    fn create_interpret(&self) -> Box<dyn Interpret<T>> {
        Box::new(KvmInterpret::new(self.clone()))
    }
    fn name(&self) -> String { "KvmScore".to_string() }
}
```

### Flatten the public API in `mod.rs`

Internal submodules are implementation detail. Re-export what callers need at the module root:

```rust
// modules/kvm/mod.rs
mod connection;
mod domain;
mod network;
mod error;
mod xml;

pub use connection::KvmConnection;
pub use domain::{VmConfig, VmConfigBuilder, VmStatus, DiskConfig, BootDevice};
pub use error::KvmError;
pub use network::NetworkConfig;
```

---

## Commit Style

Follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/):

```
feat(kvm): add network isolation support
fix(kvm): correct memory unit conversion for libvirt
refactor(kvm): replace virsh subprocess calls with virt crate bindings
docs: add coding guide
```

Keep pull requests small and single-purpose (under ~200 lines excluding generated code). Do not mix refactoring, bug fixes, and new features in one PR.

---

## When to Add Abstractions

Harmony provides powerful abstraction mechanisms: traits, generics, the Score/Interpret pattern, and capabilities. Use them judiciously.

### Add an abstraction when:

- **You have three or more concrete implementations** doing the same thing. Two is often coincidence; three is a pattern.
- **The abstraction provides compile-time safety** that prevents real bugs (e.g., capability bounds on topologies).
- **The abstraction hides genuine complexity** that callers shouldn't need to understand (e.g., XML schema generation for libvirt).

### Don't add an abstraction when:

- **It's just to avoid a few lines of boilerplate**. Copy-paste is sometimes better than a trait hierarchy.
- **You're anticipating future flexibility** that isn't needed today. YAGNI (You Aren't Gonna Need It).
- **The abstraction makes the code harder to understand** for someone unfamiliar with the codebase.
- **You're wrapping a single implementation**. A trait with one implementation is usually over-engineering.

### Signs you've over-abstracted:

- You need to explain the type system to a competent Rust developer for them to understand how to add a simple feature.
- Adding a new concrete type requires changes in multiple trait definitions.
- The word "factory" or "manager" appears in your type names.
- You have more trait definitions than concrete implementations.

### The Rule of Three for Traits

Before creating a new trait, ensure you have:

1. A clear, real use case (not hypothetical)
2. At least one concrete implementation
3. A plan for how callers will use it

Only generalize when the pattern is proven. The monitoring module is a good example: we had multiple alert senders (OKD, KubePrometheus, RHOB) before we introduced the `AlertSender` and `AlertReceiver<S>` traits. The traits emerged from real needs, not design sessions.

---

## Documentation

### Document the "why", not the "what"

Code should be self-explanatory for the "what". Comments and documentation should explain intent, rationale, and gotchas.

```rust
// Bad: restates the code
// Returns the number of VMs
fn vm_count(&self) -> usize { self.vms.len() }

// Good: explains the why
// Returns 0 if connection is lost, rather than erroring,
// because monitoring code uses this for health checks
fn vm_count(&self) -> usize { self.vms.len() }
```

### Keep examples in the `examples/` directory

Working code beats documentation. Every major feature should have a runnable example that demonstrates real usage.