feat: refactored okd cluster alerting example into an opinionated high level score

fix: dependencies and formatting
e2e tests module ready for review, k3d test works well
2026-03-09 22:44:12 -04:00 · 2026-03-09 22:25:16 -04:00 · 2026-03-09 22:17:28 -04:00 · 2026-03-09 21:59:57 -04:00 · 2026-03-09 21:54:12 -04:00 · 2026-03-09 21:01:47 -04:00
20 changed files with 2491 additions and 32 deletions
--- a/CI_and_testing_harmony_analysis.md
+++ b/CI_and_testing_harmony_analysis.md
@@ -0,0 +1,548 @@
+# CI and Testing Strategy for Harmony
+
+## Executive Summary
+
+Harmony aims to become a CNCF project, requiring a robust CI pipeline that demonstrates real-world reliability. The goal is to run **all examples** in CI, from simple k3d deployments to full HA OKD clusters on bare metal. This document provides context for designing and implementing this testing infrastructure.
+
+---
+
+## Project Context
+
+### What is Harmony?
+
+Harmony is an infrastructure automation framework that is **code-first and code-only**. Operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Key differentiators:
+
+1. **Compile-time safety**: The type system prevents "config-is-valid-but-platform-is-wrong" errors
+2. **Topology abstraction**: Write once, deploy to any environment (local k3d, OKD, bare metal, cloud)
+3. **Capability-based design**: Scores declare what they need; topologies provide what they have
+
+### Core Abstractions
+
+| Concept | Description |
+|---------|-------------|
+| **Score** | Declarative description of desired state (the "what") |
+| **Topology** | Logical representation of infrastructure (the "where") |
+| **Capability** | A feature a topology offers (the "how") |
+| **Interpret** | Execution logic connecting Score to Topology |
+
+### Compile-Time Verification
+
+```rust
+// This compiles only if K8sAnywhereTopology provides K8sclient + HelmCommand
+impl<T: Topology + K8sclient + HelmCommand> Score<T> for MyScore { ... }
+
+// This FAILS to compile - LinuxHostTopology doesn't provide K8sclient
+// (intentionally broken example for testing)
+impl<T: Topology + K8sclient> Score<T> for K8sResourceScore { ... }
+// error: LinuxHostTopology does not implement K8sclient
+```
+
+---
+
+## Current Examples Inventory
+
+### Summary Statistics
+
+| Category | Count | CI Complexity |
+|----------|-------|---------------|
+| k3d-compatible | 22 | Low - single k3d cluster |
+| OKD-specific | 4 | Medium - requires OKD cluster |
+| Bare metal | 4 | High - requires physical infra or nested virtualization |
+| Multi-cluster | 3 | High - requires multiple K8s clusters |
+| No infra needed | 4 | Trivial - local only |
+
+### Detailed Example Classification
+
+#### Tier 1: k3d-Compatible (22 examples)
+
+Can run on a local k3d cluster with minimal setup:
+
+| Example | Topology | Capabilities | Special Notes |
+|---------|----------|--------------|---------------|
+| zitadel | K8sAnywhereTopology | K8sClient, HelmCommand | SSO/Identity |
+| node_health | K8sAnywhereTopology | K8sClient | Health checks |
+| public_postgres | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Needs ingress |
+| openbao | K8sAnywhereTopology | K8sClient, HelmCommand | Vault alternative |
+| rust | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Webapp deployment |
+| cert_manager | K8sAnywhereTopology | K8sClient, CertificateManagement | TLS certificates |
+| try_rust_webapp | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Full webapp |
+| monitoring | K8sAnywhereTopology | K8sClient, HelmCommand, Observability | Prometheus |
+| application_monitoring_with_tenant | K8sAnywhereTopology | K8sClient, HelmCommand, TenantManager, Observability | Multi-tenant |
+| monitoring_with_tenant | K8sAnywhereTopology | K8sClient, HelmCommand, TenantManager, Observability | Multi-tenant |
+| postgresql | K8sAnywhereTopology | K8sClient, HelmCommand | CloudNativePG |
+| ntfy | K8sAnywhereTopology | K8sClient, HelmCommand | Notifications |
+| tenant | K8sAnywhereTopology | K8sClient, TenantManager | Namespace isolation |
+| lamp | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | LAMP stack |
+| k8s_drain_node | K8sAnywhereTopology | K8sClient | Node operations |
+| k8s_write_file_on_node | K8sAnywhereTopology | K8sClient | Node operations |
+| remove_rook_osd | K8sAnywhereTopology | K8sClient | Ceph operations |
+| validate_ceph_cluster_health | K8sAnywhereTopology | K8sClient | Ceph health |
+| kube-rs | Direct kube | K8sClient | Raw kube-rs demo |
+| brocade_snmp_server | K8sAnywhereTopology | K8sClient | SNMP collector |
+| harmony_inventory_builder | LocalhostTopology | None | Network scanning |
+| cli | LocalhostTopology | None | CLI demo |
+
+#### Tier 2: OKD/OpenShift-Specific (4 examples)
+
+Require OKD/OpenShift features not available in vanilla K8s:
+
+| Example | Topology | OKD-Specific Feature |
+|---------|----------|---------------------|
+| okd_cluster_alerts | K8sAnywhereTopology | OpenShift Monitoring CRDs |
+| operatorhub_catalog | K8sAnywhereTopology | OpenShift OperatorHub |
+| rhob_application_monitoring | K8sAnywhereTopology | RHOB (Red Hat Observability) |
+| nats-supercluster | K8sAnywhereTopology | OKD Routes (OpenShift Ingress) |
+
+#### Tier 3: Bare Metal Infrastructure (4 examples)
+
+Require physical hardware or full virtualization:
+
+| Example | Topology | Physical Requirements |
+|---------|----------|----------------------|
+| okd_installation | HAClusterTopology | OPNSense, Brocade switch, PXE boot, 3+ nodes |
+| okd_pxe | HAClusterTopology | OPNSense, Brocade switch, PXE infrastructure |
+| sttest | HAClusterTopology | Full HA cluster with all network services |
+| opnsense | OPNSenseFirewall | OPNSense firewall access |
+| opnsense_node_exporter | Custom | OPNSense firewall |
+
+#### Tier 4: Multi-Cluster (3 examples)
+
+Require multiple K8s clusters:
+
+| Example | Topology | Clusters Required |
+|---------|----------|-------------------|
+| nats | K8sAnywhereTopology × 2 | 2 clusters with NATS gateways |
+| nats-module | DecentralizedTopology | 3 clusters for supercluster |
+| multisite_postgres | FailoverTopology | 2 clusters for replication |
+
+---
+
+## Testing Categories
+
+### 1. Compile-Time Tests
+
+These tests verify that the type system correctly rejects invalid configurations:
+
+```rust
+// Should NOT compile - K8sResourceScore on LinuxHostTopology
+#[test]
+#[compile_fail]
+fn test_k8s_score_on_linux_host() {
+    let score = K8sResourceScore::new();
+    let topology = LinuxHostTopology::new();
+    // This line should fail to compile
+    harmony_cli::run(Inventory::empty(), topology, vec![Box::new(score)], None);
+}
+
+// Should compile - K8sResourceScore on K8sAnywhereTopology
+#[test]
+fn test_k8s_score_on_k8s_topology() {
+    let score = K8sResourceScore::new();
+    let topology = K8sAnywhereTopology::from_env();
+    // This should compile
+    harmony_cli::run(Inventory::empty(), topology, vec![Box::new(score)], None);
+}
+```
+
+**Implementation Options:**
+- `trybuild` crate for compile-time failure tests
+- Separate `tests/compile_fail/` directory with expected error messages
+
+### 2. Unit Tests
+
+Pure Rust logic without external dependencies:
+- Score serialization/deserialization
+- Inventory parsing
+- Type conversions
+- CRD generation
+
+**Requirements:**
+- No external services
+- Sub-second execution
+- Run on every PR
+
+### 3. Integration Tests (k3d)
+
+Deploy to a local k3d cluster:
+
+**Setup:**
+```bash
+# Install k3d
+curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+
+# Create cluster
+k3d cluster create harmony-test \
+  --agents 3 \
+  --k3s-arg "--disable=traefik@server:0"
+
+# Wait for ready
+kubectl wait --for=condition=Ready nodes --all --timeout=120s
+```
+
+**Test Matrix:**
+| Example | k3d | Test Type |
+|---------|-----|-----------|
+| zitadel | ✅ | Deploy + health check |
+| cert_manager | ✅ | Deploy + certificate issuance |
+| monitoring | ✅ | Deploy + metric collection |
+| postgresql | ✅ | Deploy + database connectivity |
+| tenant | ✅ | Namespace creation + isolation |
+
+### 4. Integration Tests (OKD)
+
+Deploy to OKD/OpenShift cluster:
+
+**Options:**
+1. **Nested virtualization**: Run OKD in VMs (slow, expensive)
+2. **CRC (CodeReady Containers)**: Single-node OKD (resource intensive)
+3. **Managed OpenShift**: AWS/Azure/GCP (costly)
+4. **Existing cluster**: Connect to pre-provisioned cluster (fastest)
+
+**Test Matrix:**
+| Example | OKD Required | Test Type |
+|---------|--------------|-----------|
+| okd_cluster_alerts | ✅ | Alert rule deployment |
+| rhob_application_monitoring | ✅ | RHOB stack deployment |
+| operatorhub_catalog | ✅ | Operator installation |
+
+### 5. End-to-End Tests (Full Infrastructure)
+
+Complete infrastructure deployment including bare metal:
+
+**Options:**
+1. **Libvirt + KVM**: Virtual machines on CI runner
+2. **Nested KVM**: KVM inside KVM (for cloud CI)
+3. **Dedicated hardware**: Physical test lab
+4. **Mock/Hybrid**: Mock physical components, real K8s
+
+---
+
+## CI Environment Options
+
+### Option A: GitHub Actions (Current Standard)
+
+**Pros:**
+- Native GitHub integration
+- Large runner ecosystem
+- Free for open source
+
+**Cons:**
+- Limited nested virtualization support
+- 6-hour job timeout
+- Resource constraints on free runners
+
+**Matrix:**
+```yaml
+strategy:
+  matrix:
+    os: [ubuntu-latest]
+    rust: [stable, beta]
+    k8s: [k3d, kind]
+    tier: [unit, k3d-integration]
+```
+
+### Option B: Self-Hosted Runners
+
+**Pros:**
+- Full control over environment
+- Can run nested virtualization
+- No time limits
+- Persistent state between runs
+
+**Cons:**
+- Maintenance overhead
+- Cost of infrastructure
+- Security considerations
+
+**Setup:**
+- Bare metal servers with KVM support
+- Pre-installed k3d, kind, CRC
+- OPNSense VM for network tests
+
+### Option C: Hybrid (GitHub + Self-Hosted)
+
+**Pros:**
+- Fast unit tests on GitHub runners
+- Heavy tests on self-hosted infrastructure
+- Cost-effective
+
+**Cons:**
+- Two CI systems to maintain
+- Complexity in test distribution
+
+### Option D: Cloud CI (CircleCI, GitLab CI, etc.)
+
+**Pros:**
+- Often better resource options
+- Docker-in-Docker support
+- Better nested virtualization
+
+**Cons:**
+- Cost
+- Less GitHub-native
+
+---
+
+## Performance Requirements
+
+### Target Execution Times
+
+| Test Category | Target Time | Current (est.) |
+|---------------|-------------|----------------|
+| Compile-time tests | < 30s | Unknown |
+| Unit tests | < 60s | Unknown |
+| k3d integration (per example) | < 120s | 60-300s |
+| Full k3d matrix | < 15 min | 30-60 min |
+| OKD integration | < 30 min | 1-2 hours |
+| Full E2E | < 2 hours | 4-8 hours |
+
+### Sub-Second Performance Strategies
+
+1. **Parallel execution**: Run independent tests concurrently
+2. **Incremental testing**: Only run affected tests on changes
+3. **Cached clusters**: Pre-warm k3d clusters
+4. **Layered testing**: Fail fast on cheaper tests
+5. **Mock external services**: Fake Discord webhooks, etc.
+
+---
+
+## Test Data and Secrets Management
+
+### Secrets Required
+
+| Secret | Use | Storage |
+|--------|-----|---------|
+| Discord webhook URL | Alert receiver tests | GitHub Secrets |
+| OPNSense credentials | Network tests | Self-hosted only |
+| Cloud provider creds | Multi-cloud tests | Vault / GitHub Secrets |
+| TLS certificates | Ingress tests | Generated on-the-fly |
+
+### Test Data
+
+| Data | Source | Strategy |
+|------|--------|----------|
+| Container images | Public registries | Cache locally |
+| Helm charts | Public repos | Vendor in repo |
+| K8s manifests | Generated | Dynamic |
+
+---
+
+## Proposed Test Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    harmony_e2e_tests Package                      │
+│                    (cargo run -p harmony_e2e_tests)              │
+├─────────────────────────────────────────────────────────────────┤
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
+│  │  Compile    │  │   Unit      │  │   Compile-Fail Tests    │  │
+│  │  Tests      │  │   Tests     │  │   (trybuild)            │  │
+│  │  < 30s      │  │   < 60s     │  │   < 30s                 │  │
+│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
+│                                                                  │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │              k3d Integration Tests                         │  │
+│  │  Self-provisions k3d cluster, runs 22 examples            │  │
+│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
+│  │  │ zitadel │ │ cert-mgr│ │ monitor │ │ postgres│ ...     │  │
+│  │  │  60s    │ │  90s    │ │ 120s    │ │  90s    │         │  │
+│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘         │  │
+│  │                  Parallel Execution                        │  │
+│  └───────────────────────────────────────────────────────────┘  │
+│                                                                  │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │              OKD Integration Tests                         │  │
+│  │  Connects to existing OKD cluster or provisions via KVM    │  │
+│  │  ┌─────────────────┐  ┌─────────────────────────────┐    │  │
+│  │  │ okd_cluster_    │  │ rhob_application_           │    │  │
+│  │  │ alerts (5 min)  │  │ monitoring (10 min)         │    │  │
+│  │  └─────────────────┘  └─────────────────────────────┘    │  │
+│  └───────────────────────────────────────────────────────────┘  │
+│                                                                  │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │              KVM-based E2E Tests                           │  │
+│  │  Uses Harmony's KVM module to provision test VMs           │  │
+│  │  ┌─────────────────┐  ┌─────────────────────────────┐    │  │
+│  │  │ okd_installation│  │ Full HA cluster deployment   │    │  │
+│  │  │ (30-60 min)     │  │ (60-120 min)                │    │  │
+│  │  └─────────────────┘  └─────────────────────────────┘    │  │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+
+Any CI system (GitHub Actions, GitLab CI, Jenkins, cron) just runs:
+    cargo run -p harmony_e2e_tests
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        GitHub Actions                            │
+├─────────────────────────────────────────────────────────────────┤
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
+│  │  Compile    │  │   Unit      │  │   Compile-Fail Tests    │  │
+│  │  Tests      │  │   Tests     │  │   (trybuild)            │  │
+│  │  < 30s      │  │   < 60s     │  │   < 30s                 │  │
+│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
+│                                                                  │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │              k3d Integration Tests                         │  │
+│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
+│  │  │ zitadel │ │ cert-mgr│ │ monitor │ │ postgres│ ...     │  │
+│  │  │  60s    │ │  90s    │ │ 120s    │ │  90s    │         │  │
+│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘         │  │
+│  │                  Parallel Execution                        │  │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────────┐
+│                     Self-Hosted Runners                          │
+├─────────────────────────────────────────────────────────────────┤
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │              OKD Integration Tests                         │  │
+│  │  ┌─────────────────┐  ┌─────────────────────────────┐    │  │
+│  │  │ okd_cluster_    │  │ rhob_application_           │    │  │
+│  │  │ alerts (5 min)  │  │ monitoring (10 min)         │    │  │
+│  │  └─────────────────┘  └─────────────────────────────┘    │  │
+│  └───────────────────────────────────────────────────────────┘  │
+│                                                                  │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │              KVM-based E2E Tests (Harmony provisions)      │  │
+│  │  ┌─────────────────────────────────────────────────────┐  │  │
+│  │  │  Harmony KVM Module provisions test VMs             │  │  │
+│  │  │  - OKD HA Cluster (3 control plane, 2 workers)     │  │  │
+│  │  │  - OPNSense VM (router/firewall)                    │  │  │
+│  │  │  - Brocade simulator VM                             │  │  │
+│  │  └─────────────────────────────────────────────────────┘  │  │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Questions for Researchers
+
+### Critical Questions
+
+1. **Self-contained test runner**: How to design `harmony_e2e_tests` package that runs all tests with a single `cargo run` command?
+
+2. **Nested Virtualization**: What are the prerequisites for running KVM inside a test environment?
+
+3. **Cost Optimization**: How to minimize cloud costs while running comprehensive E2E tests?
+
+4. **Test Isolation**: How to ensure test isolation when running parallel k3d tests?
+
+5. **State Management**: Should we persist k3d clusters between test runs, or create fresh each time?
+
+6. **Mocking Strategy**: Which external services (Discord, OPNSense, etc.) should be mocked vs. real?
+
+7. **Compile-Fail Tests**: Best practices for testing Rust compile-time errors?
+
+8. **Multi-Cluster Tests**: How to efficiently provision and connect multiple K8s clusters in tests?
+
+9. **Secrets Management**: How to handle secrets for test environments without external CI dependencies?
+
+10. **Test Flakiness**: Strategies for reducing flakiness in infrastructure tests?
+
+11. **Reporting**: How to present test results for complex multi-environment test matrices?
+
+12. **Prerequisite Detection**: How to detect and validate prerequisites (Docker, k3d, KVM) before running tests?
+
+### Research Areas
+
+1. **CI/CD Tools**: Evaluate GitHub Actions, GitLab CI, CircleCI, Tekton, Prow for Harmony's needs
+
+2. **K8s Test Tools**: Evaluate kind, k3d, minikube, microk8s for local testing
+
+3. **Mock Frameworks**: Evaluate mock-server, wiremock, hoverfly for external service mocking
+
+4. **Test Frameworks**: Evaluate built-in Rust test, nextest, cargo-tarpaulin for performance
+
+---
+
+## Success Criteria
+
+### Week 1 (Agentic Velocity)
+- [ ] Compile-time verification tests working
+- [ ] Unit tests for monitoring module
+- [ ] First 5 k3d examples running in CI
+- [ ] Mock framework for Discord webhooks
+
+### Week 2
+- [ ] All 22 k3d-compatible examples in CI
+- [ ] OKD self-hosted runner operational
+- [ ] KVM module reviewed and ready for CI
+
+### Week 3-4
+- [ ] Full E2E tests with KVM infrastructure
+- [ ] Multi-cluster tests automated
+- [ ] All examples tested in CI
+
+### Month 2
+- [ ] Sub-15-minute total CI time
+- [ ] Weekly E2E tests on bare metal
+- [ ] Documentation complete
+- [ ] Ready for CNCF submission
+
+---
+
+## Prerequisites
+
+### Hardware Requirements
+
+| Component | Minimum | Recommended |
+|-----------|---------|------------|
+| CPU | 4 cores | 8+ cores (for parallel tests) |
+| RAM | 8 GB | 32 GB (for KVM E2E) |
+| Disk | 50 GB SSD | 500 GB NVMe |
+| Docker | Required | Latest |
+| k3d | Required | v5.6.0 |
+| Kubectl | Required | v1.28.0 |
+| libvirt | Required | 9.0.0 (for KVM tests) |
+
+### Software Requirements
+| Tool | Version |
+|------|---------|
+| Rust | 1.75+ |
+| Docker | 24.0+ |
+| k3d | v5.6.0+ |
+| kubectl | v1.28+ |
+| libvirt | 9.0.0 |
+
+### Installation (One-time)
+
+```bash
+# Install Rust
+curl --proto '=https://sh.rustup.rs' -sSf | sh
+
+# Install Docker
+curl -fsSL https://get.docker.com -o docker-ce | sh
+
+# Install k3d
+curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+
+# Install kubectl
+curl -LO "https://dl.k8s.io/release/v1.28.0/bin/linux/amd64" -o /usr/local/bin/kubectl
+
+sudo mv /usr/local/bin/kubectl /usr/local/bin
+```
+
+---
+
+## Reference Materials
+### Existing Code
+
+- Examples: `examples/*/src/main.rs`
+- Topologies: `harmony/src/domain/topology/`
+- Capabilities: `harmony/src/domain/topology/` (trait definitions)
+- Scores: `harmony/src/modules/*/`
+
+### Documentation
+
+- [Coding Guide](docs/coding-guide.md)
+- [Core Concepts](docs/concepts.md)
+- [Monitoring Architecture](docs/monitoring.md)
+- [ADR-020: Monitoring](adr/020-monitoring-alerting-architecture.md)
+
+### Related Projects
+
+- Crossplane (similar abstraction model)
+- Pulumi (infrastructure as code)
+- Terraform (state management patterns)
+- Flux/ArgoCD (GitOps testing patterns)
--- a/CI_and_testing_roadmap.md
+++ b/CI_and_testing_roadmap.md
@@ -0,0 +1,201 @@
+# Pragmatic CI and Testing Roadmap for Harmony
+
+**Status**: Active implementation (March 2026)  
+**Core Principle**: Self-contained test runner — no dependency on centralized CI servers
+
+All tests are executable via one command:
+
+```bash
+cargo run -p harmony_e2e_tests
+```
+
+The `harmony_e2e_tests` package:
+- Provisions its own infrastructure when needed (k3d, KVM VMs)
+- Runs all test tiers in sequence or selectively
+- Reports results in text, JSON or JUnit XML
+- Works identically on developer laptops, any Linux server, GitHub Actions, GitLab CI, Jenkins, cron jobs, etc.
+- Is the single source of truth for what "passing CI" means
+
+## Why This Approach
+
+1. **Portability** — same command & behavior everywhere
+2. **Harmony tests Harmony** — the framework validates itself
+3. **No vendor lock-in** — GitHub Actions / GitLab CI are just triggers
+4. **Perfect reproducibility** — developers reproduce any CI failure locally in seconds
+5. **Offline capable** — after initial setup, most tiers run without internet
+
+## Architecture: `harmony_e2e_tests` Package
+
+```
+harmony_e2e_tests/
+├── Cargo.toml
+├── src/
+│   ├── main.rs              # CLI entry point
+│   ├── lib.rs               # Test runner core logic
+│   ├── tiers/
+│   │   ├── mod.rs
+│   │   ├── compile_fail.rs  # trybuild-based compile-time checks
+│   │   ├── unit.rs          # cargo test --lib --workspace
+│   │   ├── k3d.rs           # k3d cluster + parallel example runs
+│   │   ├── okd.rs           # connect to existing OKD cluster
+│   │   └── kvm.rs           # full E2E via Harmony's own KVM module
+│   ├── mocks/
+│   │   ├── mod.rs
+│   │   ├── discord.rs       # mock Discord webhook receiver
+│   │   └── opnsense.rs      # mock OPNSense firewall API
+│   └── infrastructure/
+│       ├── mod.rs
+│       ├── k3d.rs           # k3d cluster lifecycle
+│       └── kvm.rs           # helper wrappers around KVM score
+└── tests/
+    ├── ui/                  # trybuild compile-fail cases (*.rs + *.stderr)
+    └── fixtures/            # static test data / golden files
+```
+
+## CLI Interface ( clap-based )
+
+```bash
+# Run everything (default)
+cargo run -p harmony_e2e_tests
+
+# Specific tier
+cargo run -p harmony_e2e_tests -- --tier k3d
+cargo run -p harmony_e2e_tests -- --tier compile
+
+# Filter to one example
+cargo run -p harmony_e2e_tests -- --tier k3d --example monitoring
+
+# Parallelism control (k3d tier)
+cargo run -p harmony_e2e_tests -- --parallel 8
+
+# Reporting
+cargo run -p harmony_e2e_tests -- --report junit.xml
+cargo run -p harmony_e2e_tests -- --format json
+
+# Debug helpers
+cargo run -p harmony_e2e_tests -- --verbose --dry-run
+```
+
+## Test Tiers – Ordered by Speed & Cost
+
+| Tier             | Duration target | Runner type          | What it tests                                      | Isolation strategy          |
+|------------------|------------------|----------------------|----------------------------------------------------|-----------------------------|
+| Compile-fail     | < 20 s          | Any (GitHub free)    | Invalid configs don't compile                      | Per-file trybuild           |
+| Unit             | < 60 s          | Any                  | Pure Rust logic                                    | cargo test                  |
+| k3d              | 8–15 min        | GitHub / self-hosted | 22+ k3d-compatible examples                        | Fresh k3d cluster + ns-per-example |
+| OKD              | 10–30 min       | Self-hosted / CRC    | OKD-specific features (Routes, Monitoring CRDs…)   | Existing cluster via KUBECONFIG |
+| KVM Full E2E     | 60–180 min      | Self-hosted bare-metal | Full HA OKD install + bare-metal scenarios         | Harmony KVM score provisions VMs |
+
+### Tier Details & Implementation Notes
+
+1. **Compile-fail**  
+   Uses **`trybuild`** crate (standard in Rust ecosystem).  
+   Place intentional compile errors in `tests/ui/*.rs` with matching `*.stderr` expectation files.  
+   One test function replaces the old custom loop:
+
+   ```rust
+   #[test]
+   fn ui() {
+       let t = trybuild::TestCases::new();
+       t.compile_fail("tests/ui/*.rs");
+   }
+   ```
+
+2. **Unit**  
+   Simple wrapper: `cargo test --lib --workspace -- --nocapture`  
+   Consider `cargo-nextest` later for 2–3× speedup if test count grows.
+
+3. **k3d**  
+   - Provisions isolated cluster once at start (`k3d cluster create --agents 3 --no-lb --disable traefik`)
+   - Discovers examples via `[package.metadata.harmony.test-tier = "k3d"]` in `Cargo.toml`
+   - Runs in parallel with tokio semaphore (default 5–8 slots)
+   - Each example gets its own namespace
+   - Uses `defer` / `scopeguard` for guaranteed cleanup
+   - Mocks Discord webhook and OPNSense API
+
+4. **OKD**  
+   Connects to pre-provisioned cluster via `KUBECONFIG`.  
+   Validates it is actually OpenShift/OKD before proceeding.
+
+5. **KVM**  
+   Uses **Harmony’s own KVM module** to provision test VMs (control-plane + workers + OPNSense).  
+   → True “dogfooding” — if the E2E fails, the KVM score itself is likely broken.
+
+## CI Integration Patterns
+
+### Fast PR validation (GitHub Actions)
+
+```yaml
+name: Fast Tests
+on: [push, pull_request]
+jobs:
+  fast:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: dtolnay/rust-toolchain@stable
+      - name: Install Docker & k3d
+        uses: nolar/setup-k3d-k3s@v1
+      - run: cargo run -p harmony_e2e_tests -- --tier compile,unit,k3d --report junit.xml
+      - uses: actions/upload-artifact@v4
+        with: { name: test-results, path: junit.xml }
+```
+
+### Nightly / Merge heavy tests (self-hosted runner)
+
+```yaml
+name: Full E2E
+on:
+  schedule: [{ cron: "0 3 * * *" }]
+  push: { branches: [main] }
+jobs:
+  full:
+    runs-on: [self-hosted, linux, x64, kvm-capable]
+    steps:
+      - uses: actions/checkout@v4
+      - run: cargo run -p harmony_e2e_tests -- --tier okd,kvm --verbose --report junit.xml
+```
+
+## Prerequisites Auto-Check & Install
+
+```rust
+// in harmony_e2e_tests/src/infrastructure/prerequisites.rs
+async fn ensure_k3d() -> Result<()> { … }          // curl | bash if missing
+async fn ensure_docker() -> Result<()> { … }
+fn check_kvm_support() -> Result<()> { … }        // /dev/kvm + libvirt
+```
+
+## Success Criteria
+
+### Step 1
+- [ ] `harmony_e2e_tests` package created & basic CLI working
+- [ ] trybuild compile-fail suite passing
+- [ ] First 8–10 k3d examples running reliably in CI
+- [ ] Mock server for Discord webhook completed
+
+### Step 2
+- [ ] All 22 k3d-compatible examples green
+- [ ] OKD tier running on dedicated self-hosted runner
+- [ ] JUnit reporting + GitHub check integration
+- [ ] Namespace isolation + automatic retry on transient k8s errors
+
+### Step 3
+- [ ] KVM full E2E green on bare-metal runner (nightly)
+- [ ] Multi-cluster examples (nats, multisite-postgres) automated
+- [ ] Total fast CI time < 12 minutes on GitHub runners
+- [ ] Documentation: “How to add a new tested example”
+
+## Quick Start for New Contributors
+
+```bash
+# One-time setup
+rustup update stable
+cargo install trybuild cargo-nextest   # optional but recommended
+
+# Run locally (most common)
+cargo run -p harmony_e2e_tests -- --tier k3d --verbose
+
+# Just compile checks + unit
+cargo test -p harmony_e2e_tests
+```
+
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -297,6 +297,12 @@ dependencies = [
 "libc",
 ]

+[[package]]
+name = "ansi_term"
+version = "0.10.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6b3568b48b7cefa6b8ce125f9bb4989e52fbcc29ebea88df04cc7c5f12f70455"
+
 [[package]]
 name = "anstream"
 version = "0.6.21"
@@ -718,6 +724,41 @@ dependencies = [
 "tokio",
 ]

+[[package]]
+name = "brocade-snmp-server"
+version = "0.1.0"
+dependencies = [
+ "base64 0.22.1",
+ "brocade",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_types",
+ "log",
+ "serde",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "brocade-switch"
+version = "0.1.0"
+dependencies = [
+ "async-trait",
+ "brocade",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "serde",
+ "tokio",
+ "url",
+]
+
 [[package]]
 name = "brotli"
 version = "8.0.2"
@@ -871,6 +912,22 @@ dependencies = [
 "shlex",
 ]

+[[package]]
+name = "cert_manager"
+version = "0.1.0"
+dependencies = [
+ "assert_cmd",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
 [[package]]
 name = "cfg-if"
 version = "1.0.4"
@@ -1853,6 +1910,12 @@ dependencies = [
 "regex",
 ]

+[[package]]
+name = "env_home"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c7f84e12ccf0a7ddc17a6c41c93326024c42920d7ee630d04950e6926645c0fe"
+
 [[package]]
 name = "env_logger"
 version = "0.11.9"
@@ -1929,6 +1992,457 @@ dependencies = [
 name = "example"
 version = "0.0.0"

+[[package]]
+name = "example-application-monitoring-with-tenant"
+version = "0.1.0"
+dependencies = [
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_types",
+ "logging",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-cli"
+version = "0.1.0"
+dependencies = [
+ "assert_cmd",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-k8s-drain-node"
+version = "0.1.0"
+dependencies = [
+ "assert_cmd",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony-k8s",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "inquire 0.7.5",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-k8s-write-file-on-node"
+version = "0.1.0"
+dependencies = [
+ "assert_cmd",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony-k8s",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "inquire 0.7.5",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-kube-rs"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_macros",
+ "http 1.4.0",
+ "inquire 0.7.5",
+ "k8s-openapi",
+ "kube",
+ "log",
+ "serde_yaml",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-lamp"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-monitoring"
+version = "0.1.0"
+dependencies = [
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-monitoring-with-tenant"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "harmony",
+ "harmony_cli",
+ "harmony_types",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-multisite-postgres"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-nats"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-nats-module-supercluster"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "k8s-openapi",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-nats-supercluster"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "k8s-openapi",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-node-health"
+version = "0.1.0"
+dependencies = [
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+]
+
+[[package]]
+name = "example-ntfy"
+version = "0.1.0"
+dependencies = [
+ "harmony",
+ "harmony_cli",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-okd-cluster-alerts"
+version = "0.1.0"
+dependencies = [
+ "brocade",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_secret_derive",
+ "harmony_types",
+ "log",
+ "serde",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-okd-install"
+version = "0.1.0"
+dependencies = [
+ "brocade",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_secret_derive",
+ "harmony_types",
+ "log",
+ "schemars 0.8.22",
+ "serde",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-openbao"
+version = "0.1.0"
+dependencies = [
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-operatorhub-catalogsource"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-opnsense"
+version = "0.1.0"
+dependencies = [
+ "brocade",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_types",
+ "log",
+ "schemars 0.8.22",
+ "serde",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-opnsense-node-exporter"
+version = "0.1.0"
+dependencies = [
+ "async-trait",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_secret_derive",
+ "harmony_types",
+ "log",
+ "serde",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-postgresql"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-public-postgres"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-pxe"
+version = "0.1.0"
+dependencies = [
+ "brocade",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_secret_derive",
+ "harmony_types",
+ "log",
+ "schemars 0.8.22",
+ "serde",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-remove-rook-osd"
+version = "0.1.0"
+dependencies = [
+ "harmony",
+ "harmony_cli",
+ "tokio",
+]
+
+[[package]]
+name = "example-rust"
+version = "0.1.0"
+dependencies = [
+ "base64 0.22.1",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-tenant"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-try-rust-webapp"
+version = "0.1.0"
+dependencies = [
+ "base64 0.22.1",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-tui"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_macros",
+ "harmony_tui",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example-zitadel"
+version = "0.1.0"
+dependencies = [
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "tokio",
+ "url",
+]
+
+[[package]]
+name = "example_validate_ceph_cluster_health"
+version = "0.1.0"
+dependencies = [
+ "harmony",
+ "harmony_cli",
+ "tokio",
+]
+
 [[package]]
 name = "eyre"
 version = "0.6.12"
@@ -2540,6 +3054,30 @@ dependencies = [
 "tokio",
 ]

+[[package]]
+name = "harmony_e2e_tests"
+version = "0.1.0"
+dependencies = [
+ "async-trait",
+ "chrono",
+ "clap",
+ "env_logger",
+ "harmony",
+ "inventory",
+ "k3d-rs",
+ "k8s-openapi",
+ "kube",
+ "log",
+ "serde",
+ "serde_json",
+ "sqlx",
+ "tempfile",
+ "thiserror 2.0.18",
+ "tokio",
+ "tokio-stream",
+ "which",
+]
+
 [[package]]
 name = "harmony_execution"
 version = "0.1.0"
@@ -2569,6 +3107,19 @@ dependencies = [
 "tokio",
 ]

+[[package]]
+name = "harmony_inventory_builder"
+version = "0.1.0"
+dependencies = [
+ "cidr",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "tokio",
+ "url",
+]
+
 [[package]]
 name = "harmony_macros"
 version = "0.1.0"
@@ -3333,6 +3884,15 @@ dependencies = [
 "thiserror 1.0.69",
 ]

+[[package]]
+name = "inventory"
+version = "0.3.22"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "009ae045c87e7082cb72dab0ccd01ae075dd00141ddc108f43a0ea150a9e7227"
+dependencies = [
+ "rustversion",
+]
+
 [[package]]
 name = "ipnet"
 version = "2.12.0"
@@ -3732,6 +4292,15 @@ dependencies = [
 "log",
 ]

+[[package]]
+name = "logging"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "461a8beca676e8ab1bd468c92e9b4436d6368e11e96ae038209e520cfe665e46"
+dependencies = [
+ "ansi_term",
+]
+
 [[package]]
 name = "lru"
 version = "0.12.5"
@@ -4954,6 +5523,21 @@ dependencies = [
 "subtle",
 ]

+[[package]]
+name = "rhob-application-monitoring"
+version = "0.1.0"
+dependencies = [
+ "base64 0.22.1",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_types",
+ "log",
+ "tokio",
+ "url",
+]
+
 [[package]]
 name = "ring"
 version = "0.17.14"
@@ -5927,6 +6511,7 @@ dependencies = [
 "memchr",
 "once_cell",
 "percent-encoding",
+ "rustls 0.23.37",
 "serde",
 "serde_json",
 "sha2",
@@ -5936,6 +6521,7 @@ dependencies = [
 "tokio-stream",
 "tracing",
 "url",
+ "webpki-roots 0.26.11",
 ]

 [[package]]
@@ -6208,6 +6794,26 @@ dependencies = [
 "syn 2.0.117",
 ]

+[[package]]
+name = "sttest"
+version = "0.1.0"
+dependencies = [
+ "brocade",
+ "cidr",
+ "env_logger",
+ "harmony",
+ "harmony_cli",
+ "harmony_macros",
+ "harmony_secret",
+ "harmony_secret_derive",
+ "harmony_types",
+ "log",
+ "schemars 0.8.22",
+ "serde",
+ "tokio",
+ "url",
+]
+
 [[package]]
 name = "subtle"
 version = "2.6.1"
@@ -7210,6 +7816,18 @@ dependencies = [
 "rustls-pki-types",
 ]

+[[package]]
+name = "which"
+version = "7.0.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "24d643ce3fd3e5b54854602a080f34fb10ab75e0b813ee32d00ca2b44fa74762"
+dependencies = [
+ "either",
+ "env_home",
+ "rustix 1.1.4",
+ "winsafe",
+]
+
 [[package]]
 name = "whoami"
 version = "1.6.1"
@@ -7585,6 +8203,12 @@ dependencies = [
 "windows-sys 0.48.0",
 ]

+[[package]]
+name = "winsafe"
+version = "0.0.19"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d135d17ab770252ad95e9a872d365cf3090e3be864a34ab46f48555993efc904"
+
 [[package]]
 name = "wit-bindgen"
 version = "0.51.0"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -2,6 +2,7 @@
 resolver = "2"
 members = [
  "private_repos/*",
+  "examples/*",
  "harmony",
  "harmony_types",
  "harmony_macros",
@@ -16,9 +17,12 @@ members = [
  "harmony_secret_derive",
  "harmony_secret",
  "adr/agent_discovery/mdns",
-   "brocade",
-   "harmony_agent",
-   "harmony_agent/deploy", "harmony_node_readiness", "harmony-k8s",
+  "brocade",
+  "harmony_agent",
+  "harmony_agent/deploy",
+  "harmony_node_readiness",
+  "harmony-k8s",
+  "harmony_e2e_tests",
 ]

 [workspace.package]
--- a/docs/coding-guide.md
+++ b/docs/coding-guide.md
@@ -0,0 +1,299 @@
+# Harmony Coding Guide
+
+Harmony is an infrastructure automation framework. It is **code-first and code-only**: operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Good code here means a good operator experience.
+
+### Concrete context
+
+We use here the context of the KVM module to explain the coding style. This will make it very easy to understand and should translate quite well to other modules/contexts managed by Harmony like OPNSense and Kubernetes.
+
+## Core Philosophy
+
+### The Careful Craftsman Principle
+
+Harmony is a powerful framework that does a lot. With that power comes responsibility. Every abstraction, every trait, every module must earn its place. Before adding anything, ask:
+
+1. **Does this solve a real problem users have?** Not a theoretical problem, an actual one encountered in production.
+2. **Is this the simplest solution that works?** Complexity is a cost that compounds over time.
+3. **Will this make the next developer's life easier or harder?** Code is read far more often than written.
+
+When in doubt, don't abstract. Wait for the pattern to emerge from real usage. A little duplication is better than the wrong abstraction.
+
+### High-level functions over raw primitives
+
+Callers should not need to know about underlying protocols, XML schemas, or API quirks. A function that deploys a VM should accept meaningful parameters like CPU count, memory, and network name — not XML strings.
+
+```rust
+// Bad: caller constructs XML and passes it to a thin wrapper
+let xml = format!(r#"<domain type='kvm'>...</domain>"#, name, memory_kb, ...);
+executor.create_vm(&xml).await?;
+
+// Good: caller describes intent, the module handles representation
+executor.define_vm(&VmConfig::builder("my-vm")
+    .cpu(4)
+    .memory_gb(8)
+    .disk(DiskConfig::new(50))
+    .network(NetworkRef::named("mylan"))
+    .boot_order([BootDevice::Network, BootDevice::Disk])
+    .build())
+    .await?;
+```
+
+The module owns the XML, the virsh invocations, the API calls — not the caller.
+
+### Use the right abstraction layer
+
+Prefer native library bindings over shelling out to CLI tools. The `virt` crate provides direct libvirt bindings and should be used instead of spawning `virsh` subprocesses.
+
+- CLI subprocess calls are fragile: stdout/stderr parsing, exit codes, quoting, PATH differences
+- Native bindings give typed errors, no temp files, no shell escaping
+- `virt::connect::Connect` opens a connection; `virt::domain::Domain` manages VMs; `virt::network::Network` manages virtual networks
+
+### Keep functions small and well-named
+
+Each function should do one thing. If a function is doing two conceptually separate things, split it. Function names should read like plain English: `ensure_network_active`, `define_vm`, `vm_is_running`.
+
+### Prefer short modules over large files
+
+Group related types and functions by concept. A module that handles one resource (e.g., network, domain, storage) is better than a single file for everything.
+
+---
+
+## Error Handling
+
+### Use `thiserror` for all error types
+
+Define error types with `thiserror::Error`. This removes the boilerplate of implementing `Display` and `std::error::Error` by hand, keeps error messages close to their variants, and makes types easy to extend.
+
+```rust
+// Bad: hand-rolled Display + std::error::Error
+#[derive(Debug)]
+pub enum KVMError {
+    ConnectionError(String),
+    VMNotFound(String),
+}
+
+impl std::fmt::Display for KVMError { ... }
+impl std::error::Error for KVMError {}
+
+// Good: derive Display via thiserror
+#[derive(thiserror::Error, Debug)]
+pub enum KVMError {
+    #[error("connection failed: {0}")]
+    ConnectionFailed(String),
+    #[error("VM not found: {name}")]
+    VmNotFound { name: String },
+}
+```
+
+### Make bubbling errors easy with `?` and `From`
+
+`?` works on any error type for which there is a `From` impl. Add `From` conversions from lower-level errors into your module's error type so callers can use `?` without boilerplate.
+
+With `thiserror`, wrapping a foreign error is one line:
+
+```rust
+#[derive(thiserror::Error, Debug)]
+pub enum KVMError {
+    #[error("libvirt error: {0}")]
+    Libvirt(#[from] virt::error::Error),
+
+    #[error("IO error: {0}")]
+    Io(#[from] std::io::Error),
+}
+```
+
+This means a call that returns `virt::error::Error` can be `?`-propagated into a `Result<_, KVMError>` without any `.map_err(...)`.
+
+### Typed errors over stringly-typed errors
+
+Avoid `Box<dyn Error>` or `String` as error return types in library code. Callers need to distinguish errors programmatically — `KVMError::VmAlreadyExists` is actionable, `"VM already exists: foo"` as a `String` is not.
+
+At binary entry points (e.g., `main`) it is acceptable to convert to `String` or `anyhow::Error` for display.
+
+---
+
+## Logging
+
+### Use the `log` crate macros
+
+All log output must go through the `log` crate. Never use `println!`, `eprintln!`, or `dbg!` in library code. This makes output compatible with any logging backend (env_logger, tracing, structured logging, etc.).
+
+```rust
+// Bad
+println!("Creating VM: {}", name);
+
+// Good
+use log::{info, debug, warn};
+info!("Creating VM: {name}");
+debug!("VM XML:\n{xml}");
+warn!("Network already active, skipping creation");
+```
+
+Use the right level:
+
+| Level   | When to use |
+|---------|-------------|
+| `error` | Unrecoverable failures (before returning Err) |
+| `warn`  | Recoverable issues, skipped steps |
+| `info`  | High-level progress events visible in normal operation |
+| `debug` | Detailed operational info useful for debugging |
+| `trace` | Very granular, per-iteration or per-call data |
+
+Log before significant operations and after unexpected conditions. Do not log inside tight loops at `info` level.
+
+---
+
+## Types and Builders
+
+### Derive `Serialize` on all public domain types
+
+All public structs and enums that represent configuration or state should derive `serde::Serialize`. Add `Deserialize` when round-trip serialization is needed.
+
+### Builder pattern for complex configs
+
+When a type has more than three fields or optional fields, provide a builder. The builder pattern allows named, incremental construction without positional arguments.
+
+```rust
+let config = VmConfig::builder("bootstrap")
+    .cpu(4)
+    .memory_gb(8)
+    .disk(DiskConfig::new(50).labeled("os"))
+    .disk(DiskConfig::new(100).labeled("data"))
+    .network(NetworkRef::named("harmonylan"))
+    .boot_order([BootDevice::Network, BootDevice::Disk])
+    .build();
+```
+
+### Avoid `pub` fields on config structs
+
+Expose data through methods or the builder, not raw field access. This preserves the ability to validate, rename, or change representation without breaking callers.
+
+---
+
+## Async
+
+### Use `tokio` for all async runtime needs
+
+All async code runs on tokio. Use `tokio::spawn`, `tokio::time`, etc. Use `#[async_trait]` for traits with async methods.
+
+### No blocking in async context
+
+Never call blocking I/O (file I/O, network, process spawn) directly in an async function. Use `tokio::fs`, `tokio::process`, or `tokio::task::spawn_blocking` as appropriate.
+
+---
+
+## Module Structure
+
+### Follow the `Score` / `Interpret` pattern
+
+Modules that represent deployable infrastructure should implement `Score<T: Topology>` and `Interpret<T>`:
+
+- `Score` is the serializable, clonable configuration declaring *what* to deploy
+- `Interpret` does the actual work when `execute()` is called
+
+```rust
+pub struct KvmScore {
+    network: NetworkConfig,
+    vms: Vec<VmConfig>,
+}
+
+impl<T: Topology + KvmHost> Score<T> for KvmScore {
+    fn create_interpret(&self) -> Box<dyn Interpret<T>> {
+        Box::new(KvmInterpret::new(self.clone()))
+    }
+    fn name(&self) -> String { "KvmScore".to_string() }
+}
+```
+
+### Flatten the public API in `mod.rs`
+
+Internal submodules are implementation detail. Re-export what callers need at the module root:
+
+```rust
+// modules/kvm/mod.rs
+mod connection;
+mod domain;
+mod network;
+mod error;
+mod xml;
+
+pub use connection::KvmConnection;
+pub use domain::{VmConfig, VmConfigBuilder, VmStatus, DiskConfig, BootDevice};
+pub use error::KvmError;
+pub use network::NetworkConfig;
+```
+
+---
+
+## Commit Style
+
+Follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/):
+
+```
+feat(kvm): add network isolation support
+fix(kvm): correct memory unit conversion for libvirt
+refactor(kvm): replace virsh subprocess calls with virt crate bindings
+docs: add coding guide
+```
+
+Keep pull requests small and single-purpose (under ~200 lines excluding generated code). Do not mix refactoring, bug fixes, and new features in one PR.
+
+---
+
+## When to Add Abstractions
+
+Harmony provides powerful abstraction mechanisms: traits, generics, the Score/Interpret pattern, and capabilities. Use them judiciously.
+
+### Add an abstraction when:
+
+- **You have three or more concrete implementations** doing the same thing. Two is often coincidence; three is a pattern.
+- **The abstraction provides compile-time safety** that prevents real bugs (e.g., capability bounds on topologies).
+- **The abstraction hides genuine complexity** that callers shouldn't need to understand (e.g., XML schema generation for libvirt).
+
+### Don't add an abstraction when:
+
+- **It's just to avoid a few lines of boilerplate**. Copy-paste is sometimes better than a trait hierarchy.
+- **You're anticipating future flexibility** that isn't needed today. YAGNI (You Aren't Gonna Need It).
+- **The abstraction makes the code harder to understand** for someone unfamiliar with the codebase.
+- **You're wrapping a single implementation**. A trait with one implementation is usually over-engineering.
+
+### Signs you've over-abstracted:
+
+- You need to explain the type system to a competent Rust developer for them to understand how to add a simple feature.
+- Adding a new concrete type requires changes in multiple trait definitions.
+- The word "factory" or "manager" appears in your type names.
+- You have more trait definitions than concrete implementations.
+
+### The Rule of Three for Traits
+
+Before creating a new trait, ensure you have:
+
+1. A clear, real use case (not hypothetical)
+2. At least one concrete implementation
+3. A plan for how callers will use it
+
+Only generalize when the pattern is proven. The monitoring module is a good example: we had multiple alert senders (OKD, KubePrometheus, RHOB) before we introduced the `AlertSender` and `AlertReceiver<S>` traits. The traits emerged from real needs, not design sessions.
+
+---
+
+## Documentation
+
+### Document the "why", not the "what"
+
+Code should be self-explanatory for the "what". Comments and documentation should explain intent, rationale, and gotchas.
+
+```rust
+// Bad: restates the code
+// Returns the number of VMs
+fn vm_count(&self) -> usize { self.vms.len() }
+
+// Good: explains the why
+// Returns 0 if connection is lost, rather than erroring,
+// because monitoring code uses this for health checks
+fn vm_count(&self) -> usize { self.vms.len() }
+```
+
+### Keep examples in the `examples/` directory
+
+Working code beats documentation. Every major feature should have a runnable example that demonstrates real usage.
+
--- a/examples/okd_cluster_alerts/src/main.rs
+++ b/examples/okd_cluster_alerts/src/main.rs
@@ -3,12 +3,10 @@ use harmony::{
    modules::monitoring::{
        alert_channel::discord_alert_channel::DiscordReceiver,
        alert_rule::{
-            alerts::{
-                infra::opnsense::high_http_error_rate, k8s::pvc::high_pvc_fill_rate_over_two_days,
-            },
+            alerts::infra::opnsense::high_http_error_rate,
            prometheus_alert_rule::AlertManagerRuleGroup,
        },
-        okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
+        cluster_alerting::ClusterAlertingScore,
        scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
    },
    topology::{
@@ -21,22 +19,37 @@ use harmony_macros::{hurl, ip};

 #[tokio::main]
 async fn main() {
-    let platform_matcher = AlertMatcher {
-        label: "prometheus".to_string(),
-        operator: MatchOp::Eq,
-        value: "openshift-monitoring/k8s".to_string(),
-    };
-    let severity = AlertMatcher {
-        label: "severity".to_string(),
-        operator: MatchOp::Eq,
-        value: "critical".to_string(),
+    let critical_receiver = DiscordReceiver {
+        name: "critical-alerts".to_string(),
+        url: hurl!("https://discord.example.com/webhook/critical"),
+        route: AlertRoute {
+            matchers: vec![AlertMatcher {
+                label: "severity".to_string(),
+                operator: MatchOp::Eq,
+                value: "critical".to_string(),
+            }],
+            ..AlertRoute::default("critical-alerts".to_string())
+        },
    };

-    let high_http_error_rate = high_http_error_rate();
+    let warning_receiver = DiscordReceiver {
+        name: "warning-alerts".to_string(),
+        url: hurl!("https://discord.example.com/webhook/warning"),
+        route: AlertRoute {
+            matchers: vec![AlertMatcher {
+                label: "severity".to_string(),
+                operator: MatchOp::Eq,
+                value: "warning".to_string(),
+            }],
+            repeat_interval: Some("30m".to_string()),
+            ..AlertRoute::default("warning-alerts".to_string())
+        },
+    };

-    let additional_rules = AlertManagerRuleGroup::new("test-rule", vec![high_http_error_rate]);
+    let additional_rules =
+        AlertManagerRuleGroup::new("infra-alerts", vec![high_http_error_rate()]);

-    let scrape_target = PrometheusNodeExporter {
+    let firewall_scraper = PrometheusNodeExporter {
        job_name: "firewall".to_string(),
        metrics_path: "/metrics".to_string(),
        listen_address: ip!("192.168.1.1"),
@@ -44,22 +57,16 @@ async fn main() {
        ..Default::default()
    };

+    let alerting_score = ClusterAlertingScore::new()
+        .critical_receiver(Box::new(critical_receiver))
+        .warning_receiver(Box::new(warning_receiver))
+        .additional_rule(Box::new(additional_rules))
+        .scrape_target(Box::new(firewall_scraper));
+
    harmony_cli::run(
        Inventory::autoload(),
        K8sAnywhereTopology::from_env(),
-        vec![Box::new(OpenshiftClusterAlertScore {
-            receivers: vec![Box::new(DiscordReceiver {
-                name: "crit-wills-discord-channel-example".to_string(),
-                url: hurl!("https://test.io"),
-                route: AlertRoute {
-                    matchers: vec![severity],
-                    ..AlertRoute::default("crit-wills-discord-channel-example".to_string())
-                },
-            })],
-            sender: harmony::modules::monitoring::okd::OpenshiftClusterAlertSender,
-            rules: vec![Box::new(additional_rules)],
-            scrape_targets: Some(vec![Box::new(scrape_target)]),
-        })],
+        vec![Box::new(alerting_score)],
        None,
    )
    .await
--- a/harmony-k8s/src/client.rs
+++ b/harmony-k8s/src/client.rs
@@ -46,6 +46,14 @@ impl std::fmt::Debug for K8sClient {
 }

 impl K8sClient {
+    pub fn inner_client(&self) -> &Client {
+        &self.client
+    }
+
+    pub fn inner_client_clone(&self) -> Client {
+        self.client.clone()
+    }
+
    /// Create a client, reading `DRY_RUN` from the environment.
    pub fn new(client: Client) -> Self {
        Self {
--- a/harmony/src/modules/monitoring/cluster_alerting/cluster_alerting_score.rs
+++ b/harmony/src/modules/monitoring/cluster_alerting/cluster_alerting_score.rs
@@ -0,0 +1,194 @@
+use serde::Serialize;
+
+use crate::{
+    interpret::Interpret,
+    modules::monitoring::{
+        alert_rule::{
+            alerts::k8s::{
+                deployment::alert_deployment_unavailable, memory_usage::alert_high_cpu_usage,
+                memory_usage::alert_high_memory_usage, pod::alert_container_restarting,
+                pod::alert_pod_not_ready, pod::pod_failed, pvc::high_pvc_fill_rate_over_two_days,
+            },
+            prometheus_alert_rule::AlertManagerRuleGroup,
+        },
+        okd::OpenshiftClusterAlertSender,
+    },
+    score::Score,
+    topology::{
+        monitoring::{
+            AlertReceiver, AlertRoute, AlertRule, AlertingInterpret, MatchOp, Observability,
+            ScrapeTarget,
+        },
+        Topology,
+    },
+};
+
+#[derive(Debug, Clone)]
+pub struct ClusterAlertingScore {
+    pub critical_alerts_receiver: Option<Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>>,
+    pub warning_alerts_receiver: Option<Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>>,
+    pub additional_rules: Vec<Box<dyn AlertRule<OpenshiftClusterAlertSender>>>,
+    pub scrape_targets: Option<Vec<Box<dyn ScrapeTarget<OpenshiftClusterAlertSender>>>>,
+    pub include_default_rules: bool,
+}
+
+impl ClusterAlertingScore {
+    pub fn new() -> Self {
+        Self {
+            critical_alerts_receiver: None,
+            warning_alerts_receiver: None,
+            additional_rules: vec![],
+            scrape_targets: None,
+            include_default_rules: true,
+        }
+    }
+
+    pub fn critical_receiver(
+        mut self,
+        receiver: Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>,
+    ) -> Self {
+        self.critical_alerts_receiver = Some(receiver);
+        self
+    }
+
+    pub fn warning_receiver(
+        mut self,
+        receiver: Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>,
+    ) -> Self {
+        self.warning_alerts_receiver = Some(receiver);
+        self
+    }
+
+    pub fn additional_rule(
+        mut self,
+        rule: Box<dyn AlertRule<OpenshiftClusterAlertSender>>,
+    ) -> Self {
+        self.additional_rules.push(rule);
+        self
+    }
+
+    pub fn scrape_target(
+        mut self,
+        target: Box<dyn ScrapeTarget<OpenshiftClusterAlertSender>>,
+    ) -> Self {
+        self.scrape_targets
+            .get_or_insert_with(Vec::new)
+            .push(target);
+        self
+    }
+
+    pub fn with_default_rules(mut self, include: bool) -> Self {
+        self.include_default_rules = include;
+        self
+    }
+
+    fn build_default_rules(&self) -> Vec<Box<dyn AlertRule<OpenshiftClusterAlertSender>>> {
+        if !self.include_default_rules {
+            return vec![];
+        }
+
+        let critical_rules =
+            AlertManagerRuleGroup::new("cluster-critical-alerts", vec![pod_failed()]);
+
+        let warning_rules = AlertManagerRuleGroup::new(
+            "cluster-warning-alerts",
+            vec![
+                alert_deployment_unavailable(),
+                alert_container_restarting(),
+                alert_pod_not_ready(),
+                alert_high_memory_usage(),
+                alert_high_cpu_usage(),
+                high_pvc_fill_rate_over_two_days(),
+            ],
+        );
+
+        vec![Box::new(critical_rules), Box::new(warning_rules)]
+    }
+
+    fn build_receivers(&self) -> Vec<Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>> {
+        let mut receivers = vec![];
+
+        if let Some(ref critical_receiver) = self.critical_alerts_receiver {
+            receivers.push(critical_receiver.clone());
+        }
+
+        if let Some(ref warning_receiver) = self.warning_alerts_receiver {
+            receivers.push(warning_receiver.clone());
+        }
+
+        receivers
+    }
+}
+
+impl Default for ClusterAlertingScore {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T> for ClusterAlertingScore {
+    fn name(&self) -> String {
+        "ClusterAlertingScore".to_string()
+    }
+
+    fn create_interpret(&self) -> Box<dyn Interpret<T>> {
+        let mut all_rules = self.build_default_rules();
+        all_rules.extend(self.additional_rules.clone());
+
+        let receivers = self.build_receivers();
+
+        Box::new(AlertingInterpret {
+            sender: OpenshiftClusterAlertSender,
+            receivers,
+            rules: all_rules,
+            scrape_targets: self.scrape_targets.clone(),
+        })
+    }
+}
+
+impl Serialize for ClusterAlertingScore {
+    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
+    where
+        S: serde::Serializer,
+    {
+        serde_json::json!({
+            "name": "ClusterAlertingScore",
+            "include_default_rules": self.include_default_rules,
+            "has_critical_receiver": self.critical_alerts_receiver.is_some(),
+            "has_warning_receiver": self.warning_alerts_receiver.is_some(),
+            "additional_rules_count": self.additional_rules.len(),
+            "scrape_targets_count": self.scrape_targets.as_ref().map(|t| t.len()).unwrap_or(0),
+        })
+        .serialize(serializer)
+    }
+}
+
+pub fn critical_route() -> AlertRoute {
+    AlertRoute {
+        receiver: "critical".to_string(),
+        matchers: vec![crate::topology::monitoring::AlertMatcher {
+            label: "severity".to_string(),
+            operator: MatchOp::Eq,
+            value: "critical".to_string(),
+        }],
+        group_by: vec![],
+        repeat_interval: Some("5m".to_string()),
+        continue_matching: false,
+        children: vec![],
+    }
+}
+
+pub fn warning_route() -> AlertRoute {
+    AlertRoute {
+        receiver: "warning".to_string(),
+        matchers: vec![crate::topology::monitoring::AlertMatcher {
+            label: "severity".to_string(),
+            operator: MatchOp::Eq,
+            value: "warning".to_string(),
+        }],
+        group_by: vec![],
+        repeat_interval: Some("30m".to_string()),
+        continue_matching: false,
+        children: vec![],
+    }
+}
--- a/harmony/src/modules/monitoring/cluster_alerting/mod.rs
+++ b/harmony/src/modules/monitoring/cluster_alerting/mod.rs
@@ -0,0 +1,3 @@
+mod cluster_alerting_score;
+
+pub use cluster_alerting_score::{critical_route, warning_route, ClusterAlertingScore};
--- a/harmony/src/modules/monitoring/mod.rs
+++ b/harmony/src/modules/monitoring/mod.rs
@@ -1,6 +1,7 @@
 pub mod alert_channel;
 pub mod alert_rule;
 pub mod application_monitoring;
+pub mod cluster_alerting;
 pub mod grafana;
 pub mod kube_prometheus;
 pub mod ntfy;
--- a/harmony_e2e_tests/Cargo.toml
+++ b/harmony_e2e_tests/Cargo.toml
@@ -0,0 +1,32 @@
+[package]
+name = "harmony_e2e_tests"
+version = "0.1.0"
+edition = "2021"
+description = "Harmony end-to-end test runner"
+license = "Apache-2.0"
+repository = "https://github.com/nationtech/harmony"
+rust-version = "1.75.0"
+
+[dependencies]
+clap = { version = "4.4", features = ["derive"] }
+chrono = { version = "0.4", features = ["serde"] }
+env_logger = "0.11"
+kube = { workspace = true }
+k8s-openapi = { workspace = true }
+log = "0.4"
+serde = { workspace = true }
+serde_json = { workspace = true }
+thiserror = "2.0"
+tokio = { workspace = true }
+which = "7.0"
+inventory = "0.3"
+tempfile = { workspace = true }
+k3d-rs = { path = "../k3d" }
+harmony = { path = "../harmony" }
+sqlx = { version = "0.8", features = ["runtime-tokio", "postgres", "tls-rustls"] }
+tokio-stream = "0.1"
+async-trait.workspace = true
+
+[[bin]]
+name = "harmony-e2e"
+path = "src/main.rs"
--- a/harmony_e2e_tests/src/main.rs
+++ b/harmony_e2e_tests/src/main.rs
@@ -0,0 +1,68 @@
+mod test_harness;
+mod tests;
+
+use clap::{Parser, Subcommand};
+use test_harness::find_tests;
+
+#[derive(Parser)]
+#[command(name = "harmony-e2e")]
+#[command(about = "Harmony end-to-end test runner", long_about = None)]
+struct Cli {
+    #[command(subcommand)]
+    command: Commands,
+
+    #[arg(short, long, default_value = "info")]
+    log_level: String,
+}
+
+#[derive(Subcommand)]
+enum Commands {
+    List {
+        #[arg(short, long)]
+        filter: Option<String>,
+    },
+    Run {
+        #[arg(short, long)]
+        filter: Option<String>,
+    },
+}
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    let cli = Cli::parse();
+
+    env_logger::Builder::from_env(env_logger::Env::default().default_filter_or(&cli.log_level))
+        .init();
+
+    match cli.command {
+        Commands::List { filter } => {
+            let tests = find_tests(filter.as_deref());
+            if tests.is_empty() {
+                println!("No tests found matching filter.");
+            } else {
+                println!("Available tests:");
+                for test in tests {
+                    println!("  {} - {}", test.name(), test.description());
+                }
+            }
+        }
+        Commands::Run { filter } => {
+            let tests = find_tests(filter.as_deref());
+            if tests.is_empty() {
+                return Err("No tests found matching filter.".into());
+            }
+
+            log::info!("Running {} test(s)...", tests.len());
+            for test in tests {
+                log::info!("=== Running: {} ===", test.name());
+                test.run()
+                    .await
+                    .map_err(|e| e as Box<dyn std::error::Error>)?;
+                log::info!("=== Passed: {} ===", test.name());
+            }
+            log::info!("All tests passed!");
+        }
+    }
+
+    Ok(())
+}
--- a/harmony_e2e_tests/src/test_harness.rs
+++ b/harmony_e2e_tests/src/test_harness.rs
@@ -0,0 +1,60 @@
+use async_trait::async_trait;
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+pub enum HarnessError {
+    #[error("IO error: {0}")]
+    Io(#[from] std::io::Error),
+}
+
+pub struct TestContext {
+    pub test_name: String,
+    pub namespace: String,
+}
+
+impl TestContext {
+    pub fn new(test_name: &str) -> Result<Self, HarnessError> {
+        let namespace = format!("harmony-test-{}", test_name);
+
+        Ok(Self {
+            test_name: test_name.to_string(),
+            namespace,
+        })
+    }
+}
+
+#[async_trait]
+pub trait Test: Sync {
+    fn name(&self) -> &'static str;
+    fn description(&self) -> &'static str;
+    async fn run(&self) -> Result<(), Box<dyn std::error::Error + Send + Sync>>;
+}
+
+pub struct TestEntry {
+    pub test: &'static dyn Test,
+}
+
+inventory::collect!(TestEntry);
+
+#[macro_export]
+macro_rules! register_test {
+    ($test:expr) => {
+        inventory::submit! {
+            $crate::test_harness::TestEntry { test: $test }
+        }
+    };
+}
+
+pub fn all_tests() -> impl Iterator<Item = &'static dyn Test> {
+    inventory::iter::<TestEntry>().map(|entry| entry.test)
+}
+
+pub fn find_tests(filter: Option<&str>) -> Vec<&'static dyn Test> {
+    let filter = filter.map(|f| f.to_lowercase());
+    all_tests()
+        .filter(|t| match &filter {
+            Some(f) => t.name().to_lowercase().contains(f),
+            None => true,
+        })
+        .collect()
+}
--- a/harmony_e2e_tests/src/tests/cnpg_postgres.rs
+++ b/harmony_e2e_tests/src/tests/cnpg_postgres.rs
@@ -0,0 +1,206 @@
+use crate::register_test;
+use crate::test_harness::{HarnessError, Test, TestContext};
+use async_trait::async_trait;
+use harmony::{
+    inventory::Inventory,
+    modules::postgresql::{capability::PostgreSQLConfig, PostgreSQLScore},
+    score::Score,
+    topology::{K8sAnywhereTopology, K8sclient, Topology},
+};
+use k8s_openapi::api::core::v1::Pod;
+use kube::api::{Api, ListParams};
+use log::{info, warn};
+use std::time::Duration;
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+pub enum PostgresTestError {
+    #[error("Failed to create test context: {0}")]
+    ContextCreation(#[from] HarnessError),
+
+    #[error("Failed to initialize topology: {0}")]
+    TopologyInit(String),
+
+    #[error("Failed to interpret postgresql score: {0}")]
+    InterpretError(String),
+
+    #[error("Failed to get k8s client: {0}")]
+    K8sClient(String),
+
+    #[error("PostgreSQL deployment timed out after {timeout_seconds}s in namespace {namespace}")]
+    DeploymentTimeout {
+        namespace: String,
+        timeout_seconds: u64,
+    },
+
+    #[error("PostgreSQL connection verification failed: {0}")]
+    ConnectionVerification(String),
+
+    #[error("SQL query failed: {0}")]
+    SqlQueryFailed(String),
+}
+
+pub struct CnpgPostgresTest;
+
+impl CnpgPostgresTest {
+    pub const INSTANCE: Self = CnpgPostgresTest;
+}
+
+#[async_trait]
+impl Test for CnpgPostgresTest {
+    fn name(&self) -> &'static str {
+        "cnpg_postgres"
+    }
+
+    fn description(&self) -> &'static str {
+        "CNPG PostgreSQL deployment using Harmony's PostgreSQL module"
+    }
+
+    async fn run(&self) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
+        run_impl()
+            .await
+            .map_err(|e| Box::new(e) as Box<dyn std::error::Error + Send + Sync>)
+    }
+}
+
+register_test!(&CnpgPostgresTest::INSTANCE);
+
+async fn run_impl() -> Result<(), PostgresTestError> {
+    let ctx = TestContext::new("cnpg-postgres")?;
+
+    info!("=== Test: CNPG PostgreSQL deployment ===");
+
+    info!("Step 1: Initializing K8sAnywhereTopology...");
+    let topology = K8sAnywhereTopology::from_env();
+
+    info!("Step 2: Ensuring topology is ready...");
+    topology
+        .ensure_ready()
+        .await
+        .map_err(|e| PostgresTestError::TopologyInit(e.to_string()))?;
+
+    info!("Step 3: Creating PostgreSQL deployment score...");
+    let pg_score = PostgreSQLScore {
+        config: PostgreSQLConfig {
+            cluster_name: format!("{}-pg", ctx.test_name),
+            namespace: ctx.namespace.clone(),
+            instances: 1,
+            ..Default::default()
+        },
+    };
+
+    info!("Step 4: Deploying PostgreSQL using Harmony's PostgreSQL module...");
+    let outcome = pg_score
+        .interpret(&Inventory::empty(), &topology)
+        .await
+        .map_err(|e| PostgresTestError::InterpretError(e.to_string()))?;
+
+    info!("Deployment outcome: {}", outcome.message);
+
+    info!("Step 5: Waiting for PostgreSQL cluster to be ready...");
+    let cluster_name = &pg_score.config.cluster_name;
+    wait_for_postgres_ready(&topology, &ctx.namespace, cluster_name, 300).await?;
+
+    info!("Step 6: Verifying PostgreSQL is working with a SQL query...");
+    let result = verify_postgres_connection(&ctx.namespace, cluster_name).await?;
+
+    info!("Query result: {}", result);
+    assert!(result.contains("1"), "Expected query to return 1");
+
+    info!("=== Test PASSED: CNPG PostgreSQL deployment ===\n");
+
+    Ok(())
+}
+
+async fn wait_for_postgres_ready(
+    topology: &K8sAnywhereTopology,
+    namespace: &str,
+    cluster_name: &str,
+    timeout_seconds: u64,
+) -> Result<String, PostgresTestError> {
+    let client = topology
+        .k8s_client()
+        .await
+        .map_err(PostgresTestError::K8sClient)?;
+
+    let pods: Api<Pod> = Api::namespaced(client.inner_client_clone(), namespace);
+    let label_selector = format!("postgresql.cnpg.io/cluster={}", cluster_name);
+
+    let deadline = tokio::time::Instant::now() + Duration::from_secs(timeout_seconds);
+    let mut interval = tokio::time::interval(Duration::from_secs(5));
+
+    loop {
+        interval.tick().await;
+
+        if tokio::time::Instant::now() > deadline {
+            return Err(PostgresTestError::DeploymentTimeout {
+                namespace: namespace.to_string(),
+                timeout_seconds,
+            });
+        }
+
+        let pod_list = pods
+            .list(&ListParams::default().labels(&label_selector))
+            .await
+            .map_err(|e| PostgresTestError::K8sClient(e.to_string()))?;
+
+        for pod in pod_list.items {
+            if let Some(status) = &pod.status {
+                if status.phase.as_deref() == Some("Running") {
+                    if let Some(conditions) = &status.conditions {
+                        if conditions
+                            .iter()
+                            .any(|c| c.type_ == "Ready" && c.status == "True")
+                        {
+                            let pod_name = pod.metadata.name.clone().unwrap_or_default();
+                            info!("PostgreSQL pod '{}' is ready", pod_name);
+                            return Ok(pod_name);
+                        }
+                    }
+                }
+            }
+        }
+
+        warn!(
+            "Waiting for PostgreSQL pod with label '{}' to be ready...",
+            label_selector
+        );
+    }
+}
+
+async fn verify_postgres_connection(
+    namespace: &str,
+    cluster_name: &str,
+) -> Result<String, PostgresTestError> {
+    let pod_name = format!("{}-1", cluster_name);
+
+    let mut cmd = tokio::process::Command::new("kubectl");
+    cmd.args([
+        "exec",
+        "-n",
+        namespace,
+        &pod_name,
+        "--",
+        "psql",
+        "-U",
+        "app",
+        "-d",
+        "app",
+        "-t",
+        "-c",
+        "SELECT 1 AS test;",
+    ]);
+
+    let output = cmd
+        .output()
+        .await
+        .map_err(|e| PostgresTestError::ConnectionVerification(e.to_string()))?;
+
+    if !output.status.success() {
+        return Err(PostgresTestError::SqlQueryFailed(
+            String::from_utf8_lossy(&output.stderr).to_string(),
+        ));
+    }
+
+    Ok(String::from_utf8_lossy(&output.stdout).trim().to_string())
+}
--- a/harmony_e2e_tests/src/tests/k3d_cluster.rs
+++ b/harmony_e2e_tests/src/tests/k3d_cluster.rs
@@ -0,0 +1,147 @@
+use crate::register_test;
+use crate::test_harness::{HarnessError, Test, TestContext};
+use async_trait::async_trait;
+use harmony::topology::{K8sAnywhereTopology, K8sclient, Topology};
+use k8s_openapi::api::core::v1::Node;
+use kube::api::{Api, ListParams};
+use log::info;
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+pub enum K3dTestError {
+    #[error("Failed to create test context: {0}")]
+    ContextCreation(#[from] HarnessError),
+
+    #[error("Failed to initialize topology: {0}")]
+    TopologyInit(String),
+
+    #[error("Failed to get k8s client: {0}")]
+    K8sClient(String),
+
+    #[error("Cluster validation failed: expected {expected_nodes} nodes, found {nodes_count}")]
+    ClusterValidation {
+        nodes_count: usize,
+        expected_nodes: usize,
+    },
+
+    #[error("Node {node_name} is not ready")]
+    NodeNotReady { node_name: String },
+
+    #[error("No nodes found in cluster")]
+    NoNodesFound,
+}
+
+pub struct K3dClusterTest;
+
+impl K3dClusterTest {
+    pub const INSTANCE: Self = K3dClusterTest;
+}
+
+#[async_trait]
+impl Test for K3dClusterTest {
+    fn name(&self) -> &'static str {
+        "k3d_cluster"
+    }
+
+    fn description(&self) -> &'static str {
+        "k3d cluster creation with Harmony modules"
+    }
+
+    async fn run(&self) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
+        run_impl()
+            .await
+            .map_err(|e| Box::new(e) as Box<dyn std::error::Error + Send + Sync>)
+    }
+}
+
+register_test!(&K3dClusterTest::INSTANCE);
+
+async fn run_impl() -> Result<(), K3dTestError> {
+    info!("=== Test: k3d cluster creation with Harmony modules ===");
+
+    let _ctx = TestContext::new("k3d-cluster")?;
+
+    info!("Step 1: Initializing K8sAnywhereTopology...");
+    let topology = K8sAnywhereTopology::from_env();
+
+    info!("Step 2: Ensuring topology is ready (this installs k3d if needed)...");
+    topology
+        .ensure_ready()
+        .await
+        .map_err(|e| K3dTestError::TopologyInit(e.to_string()))?;
+
+    info!("Step 3: Validating cluster is operational...");
+    validate_cluster(&topology).await?;
+
+    info!("Step 4: Verifying all nodes are ready...");
+    verify_nodes_ready(&topology).await?;
+
+    info!("=== Test PASSED: k3d cluster creation ===");
+    Ok(())
+}
+
+async fn validate_cluster(topology: &K8sAnywhereTopology) -> Result<(), K3dTestError> {
+    let client = topology
+        .k8s_client()
+        .await
+        .map_err(K3dTestError::K8sClient)?;
+
+    let nodes: Api<Node> = Api::all(client.inner_client_clone());
+    let node_list = nodes
+        .list(&ListParams::default())
+        .await
+        .map_err(|e| K3dTestError::K8sClient(e.to_string()))?;
+
+    let nodes_count = node_list.items.len();
+
+    if nodes_count == 0 {
+        return Err(K3dTestError::NoNodesFound);
+    }
+
+    info!("Found {} node(s) in cluster", nodes_count);
+
+    for node in &node_list.items {
+        let node_name = node.metadata.name.as_deref().unwrap_or("unknown");
+        info!("  - Node: {}", node_name);
+    }
+
+    if nodes_count < 1 {
+        return Err(K3dTestError::ClusterValidation {
+            nodes_count,
+            expected_nodes: 1,
+        });
+    }
+
+    Ok(())
+}
+
+async fn verify_nodes_ready(topology: &K8sAnywhereTopology) -> Result<(), K3dTestError> {
+    let client = topology
+        .k8s_client()
+        .await
+        .map_err(K3dTestError::K8sClient)?;
+
+    let nodes: Api<Node> = Api::all(client.inner_client_clone());
+    let node_list = nodes
+        .list(&ListParams::default())
+        .await
+        .map_err(|e| K3dTestError::K8sClient(e.to_string()))?;
+
+    for node in node_list.items {
+        let node_name = node.metadata.name.clone().unwrap_or_default();
+
+        let conditions = node.status.and_then(|s| s.conditions).unwrap_or_default();
+
+        let ready = conditions
+            .iter()
+            .any(|c| c.type_ == "Ready" && c.status == "True");
+
+        if !ready {
+            return Err(K3dTestError::NodeNotReady { node_name });
+        }
+
+        info!("Node '{}' is Ready", node_name);
+    }
+
+    Ok(())
+}
--- a/harmony_e2e_tests/src/tests/mod.rs
+++ b/harmony_e2e_tests/src/tests/mod.rs
@@ -0,0 +1,3 @@
+pub mod cnpg_postgres;
+pub mod k3d_cluster;
+pub mod multicluster_postgres;
--- a/harmony_e2e_tests/src/tests/multicluster_postgres.rs
+++ b/harmony_e2e_tests/src/tests/multicluster_postgres.rs
@@ -0,0 +1,54 @@
+use crate::register_test;
+use crate::test_harness::{HarnessError, Test, TestContext};
+use async_trait::async_trait;
+use log::info;
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+pub enum MulticlusterPostgresTestError {
+    #[error("Failed to create test context: {0}")]
+    ContextCreation(#[from] HarnessError),
+}
+
+pub struct MulticlusterPostgresTest;
+
+impl MulticlusterPostgresTest {
+    pub const INSTANCE: Self = MulticlusterPostgresTest;
+}
+
+#[async_trait]
+impl Test for MulticlusterPostgresTest {
+    fn name(&self) -> &'static str {
+        "multicluster_postgres"
+    }
+
+    fn description(&self) -> &'static str {
+        "Multi-cluster PostgreSQL with failover"
+    }
+
+    async fn run(&self) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
+        run_impl()
+            .await
+            .map_err(|e| Box::new(e) as Box<dyn std::error::Error + Send + Sync>)
+    }
+}
+
+register_test!(&MulticlusterPostgresTest::INSTANCE);
+
+async fn run_impl() -> Result<(), MulticlusterPostgresTestError> {
+    let _ctx = TestContext::new("multicluster-postgres")?;
+
+    info!("=== Test: Multi-cluster PostgreSQL with failover ===");
+    info!("This test is not yet fully implemented.");
+    info!("It will:");
+    info!("  1. Create two k3d clusters (primary and replica)");
+    info!("  2. Deploy CNPG operator on both clusters");
+    info!("  3. Deploy primary PostgreSQL with LoadBalancer service");
+    info!("  4. Extract replication certificates from primary");
+    info!("  5. Deploy replica PostgreSQL configured to replicate from primary");
+    info!("  6. Insert test data on primary");
+    info!("  7. Verify data is replicated to replica");
+    info!("=== Test SKIPPED: Multi-cluster PostgreSQL (not implemented) ===\n");
+
+    Ok(())
+}
--- a/infrastructure.rs
+++ b/infrastructure.rs
--- a/mocks.rs,
+++ b/mocks.rs,
--- a/tiers.rs,
+++ b/tiers.rs,
Author	SHA1	Message	Date
Jean-Gabriel Gill-Couture	18d8ba2210	feat: refactored okd cluster alerting example into an opinionated high level score	2026-03-09 22:44:12 -04:00
Jean-Gabriel Gill-Couture	c5b292d99b	fix: dependencies and formatting	2026-03-09 22:25:16 -04:00
Jean-Gabriel Gill-Couture	0258b31fd2	e2e tests module ready for review, k3d test works well	2026-03-09 22:17:28 -04:00
Jean-Gabriel Gill-Couture	4407792bd5	chore: use async trait instead of ugly types	2026-03-09 21:59:57 -04:00
Jean-Gabriel Gill-Couture	7978a63004	wip: harmony e2e test module coming along	2026-03-09 21:54:12 -04:00
Jean-Gabriel Gill-Couture	58d00c95bb	Review new test module and slightly improve testing roadmap	2026-03-09 21:01:47 -04:00
Jean-Gabriel Gill-Couture	7d14f7646c	fix(e2e): fix compilation errors in multicluster test - multicluster_postgres test was incomplete, simplified to placeholder - Added todo!() for multi-cluster PostgreSQL test to be implemented later	2026-03-09 20:15:41 -04:00
Jean-Gabriel Gill-Couture	69dd763d6e	feat(e2e): initial e2e test runner with k3d and cnpg tests - Add harmony_e2e_tests crate with CLI test runner - k3d_cluster test: provisions k3d cluster and verifies nodes - cnpg_postgres test: deploys CNPG operator, creates PostgreSQL cluster, waits for readiness, executes SQL query - multicluster_postgres test: placeholder for next iteration	2026-03-09 19:39:59 -04:00
Jean-Gabriel Gill-Couture	2e46ac3418	e2e tests wip	2026-03-09 19:29:22 -04:00