harmony/ROADMAP/06-e2e-tests-kvm.md

# Phase 6: E2E Tests for OKD HA Cluster on KVM

## Goal

Prove the full OKD bare-metal installation flow works end-to-end using KVM virtual machines. This is the ultimate validation of Harmony's core value proposition: declare an OKD cluster, point it at infrastructure, watch it materialize.

## Prerequisites

- Phase 5 complete (test harness exists, k3d tests passing)
- `feature/kvm-module` merged to main
- A CI runner with libvirt/KVM access and nested virtualization support

## Architecture

The KVM branch already has a `kvm_okd_ha_cluster` example that creates:

```
                        Host bridge (WAN)
                              |
                    +--------------------+
                    |  OPNsense          |  192.168.100.1
                    |  gateway + PXE     |
                    +--------+-----------+
                             |
                  harmonylan (192.168.100.0/24)
                   +---------+---------+---------+---------+
                   |         |         |         |         |
              +----+---+ +---+---+ +---+---+ +---+---+ +--+----+
              | cp0    | | cp1   | | cp2   | |worker0| |worker1|
              | .10    | | .11   | | .12   | | .20   | | .21   |
              +--------+ +-------+ +-------+ +-------+ +---+---+
                                                            |
                                                      +-----+----+
                                                      | worker2  |
                                                      | .22      |
                                                      +----------+
```

The test needs to orchestrate this entire setup, wait for OKD to converge, and assert the cluster is healthy.

## Tasks

### 6.1 Start with `example_linux_vm` — the simplest KVM test

Before tackling the full OKD stack, validate the KVM module itself with the simplest possible test:

```rust
// tests/e2e/tests/kvm_linux_vm.rs

#[tokio::test]
#[ignore] // Requires libvirt access — run with: cargo test -- --ignored
async fn test_linux_vm_boots_from_iso() {
    let executor = KvmExecutor::from_env().unwrap();

    // Create isolated network
    let network = NetworkConfig {
        name: "e2e-test-net".to_string(),
        bridge: "virbr200".to_string(),
        // ...
    };
    executor.ensure_network(&network).await.unwrap();

    // Define and start VM
    let vm_config = VmConfig::builder("e2e-linux-test")
        .vcpus(1)
        .memory_gb(1)
        .disk(5)
        .network(NetworkRef::named("e2e-test-net"))
        .cdrom("https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso")
        .boot_order([BootDevice::Cdrom, BootDevice::Disk])
        .build();

    executor.ensure_vm(&vm_config).await.unwrap();
    executor.start_vm("e2e-linux-test").await.unwrap();

    // Assert VM is running
    let status = executor.vm_status("e2e-linux-test").await.unwrap();
    assert_eq!(status, VmStatus::Running);

    // Cleanup
    executor.destroy_vm("e2e-linux-test").await.unwrap();
    executor.undefine_vm("e2e-linux-test").await.unwrap();
    executor.delete_network("e2e-test-net").await.unwrap();
}
```

This test validates:
- ISO download works (via `harmony_assets` if refactored, or built-in KVM module download)
- libvirt XML generation is correct
- VM lifecycle (define → start → status → destroy → undefine)
- Network creation/deletion

### 6.2 OKD HA Cluster E2E test

The full integration test. This is long-running (30-60 minutes) and should only run nightly or on-demand.

```rust
// tests/e2e/tests/kvm_okd_ha.rs

#[tokio::test]
#[ignore] // Requires KVM + significant resources. Run nightly.
async fn test_okd_ha_cluster_on_kvm() {
    // 1. Create virtual infrastructure
    //    - OPNsense gateway VM
    //    - 3 control plane VMs
    //    - 3 worker VMs
    //    - Virtual network (harmonylan)

    // 2. Run OKD installation scores
    //    (the kvm_okd_ha_cluster example, but as a test)

    // 3. Wait for OKD API server to become reachable
    //    - Poll https://api.okd.harmonylan:6443 until it responds
    //    - Timeout: 30 minutes

    // 4. Assert cluster health
    //    - All nodes in Ready state
    //    - ClusterVersion reports Available=True
    //    - Sample workload (nginx) deploys and pod reaches Running

    // 5. Cleanup
    //    - Destroy all VMs
    //    - Delete virtual networks
    //    - Clean up disk images
}
```

### 6.3 CI runner requirements

The KVM E2E test needs a runner with:

- **Hardware**: 32GB+ RAM, 8+ CPU cores, 100GB+ disk
- **Software**: libvirt, QEMU/KVM, `virsh`, nested virtualization enabled
- **Network**: Outbound internet access (to download ISOs, OKD images)
- **Permissions**: User in `libvirt` group, or root access

Options:
- **Dedicated bare-metal machine** registered as a self-hosted GitHub Actions runner
- **Cloud VM with nested virt** (e.g., GCP n2-standard-8 with `--enable-nested-virtualization`)
- **Manual trigger only** — developer runs locally, CI just tracks pass/fail

### 6.4 Nightly CI job

```yaml
# .github/workflows/e2e-kvm.yml
name: E2E KVM Tests
on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM daily
  workflow_dispatch:       # Manual trigger

jobs:
  kvm-tests:
    runs-on: [self-hosted, kvm]  # Label for KVM-capable runners
    timeout-minutes: 90
    steps:
      - uses: actions/checkout@v4

      - name: Run KVM E2E tests
        run: cargo test -p harmony-e2e-tests -- --ignored --test-threads=1
        env:
          RUST_LOG: info
          HARMONY_KVM_URI: qemu:///system

      - name: Cleanup VMs on failure
        if: failure()
        run: |
          virsh list --all --name | grep e2e | xargs -I {} virsh destroy {} || true
          virsh list --all --name | grep e2e | xargs -I {} virsh undefine {} --remove-all-storage || true
```

### 6.5 Test resource management

KVM tests create real resources that must be cleaned up even on failure. Implement a test fixture pattern:

```rust
struct KvmTestFixture {
    executor: KvmExecutor,
    vms: Vec<String>,
    networks: Vec<String>,
}

impl KvmTestFixture {
    fn track_vm(&mut self, name: &str) { self.vms.push(name.to_string()); }
    fn track_network(&mut self, name: &str) { self.networks.push(name.to_string()); }
}

impl Drop for KvmTestFixture {
    fn drop(&mut self) {
        // Best-effort cleanup of all tracked resources
        for vm in &self.vms {
            let _ = std::process::Command::new("virsh")
                .args(["destroy", vm]).output();
            let _ = std::process::Command::new("virsh")
                .args(["undefine", vm, "--remove-all-storage"]).output();
        }
        for net in &self.networks {
            let _ = std::process::Command::new("virsh")
                .args(["net-destroy", net]).output();
            let _ = std::process::Command::new("virsh")
                .args(["net-undefine", net]).output();
        }
    }
}
```

## Deliverables

- [ ] `test_linux_vm_boots_from_iso` — passing KVM smoke test
- [ ] `test_okd_ha_cluster_on_kvm` — full OKD installation test
- [ ] `KvmTestFixture` with resource cleanup on test failure
- [ ] Nightly CI job on KVM-capable runner
- [ ] Force-cleanup script for leaked VMs/networks
- [ ] Documentation: how to set up a KVM runner for E2E tests