Files
harmony/ROADMAP/06-e2e-tests-kvm.md

215 lines
7.2 KiB
Markdown

# Phase 6: E2E Tests for OKD HA Cluster on KVM
## Goal
Prove the full OKD bare-metal installation flow works end-to-end using KVM virtual machines. This is the ultimate validation of Harmony's core value proposition: declare an OKD cluster, point it at infrastructure, watch it materialize.
## Prerequisites
- Phase 5 complete (test harness exists, k3d tests passing)
- `feature/kvm-module` merged to main
- A CI runner with libvirt/KVM access and nested virtualization support
## Architecture
The KVM branch already has a `kvm_okd_ha_cluster` example that creates:
```
Host bridge (WAN)
|
+--------------------+
| OPNsense | 192.168.100.1
| gateway + PXE |
+--------+-----------+
|
harmonylan (192.168.100.0/24)
+---------+---------+---------+---------+
| | | | |
+----+---+ +---+---+ +---+---+ +---+---+ +--+----+
| cp0 | | cp1 | | cp2 | |worker0| |worker1|
| .10 | | .11 | | .12 | | .20 | | .21 |
+--------+ +-------+ +-------+ +-------+ +---+---+
|
+-----+----+
| worker2 |
| .22 |
+----------+
```
The test needs to orchestrate this entire setup, wait for OKD to converge, and assert the cluster is healthy.
## Tasks
### 6.1 Start with `example_linux_vm` — the simplest KVM test
Before tackling the full OKD stack, validate the KVM module itself with the simplest possible test:
```rust
// tests/e2e/tests/kvm_linux_vm.rs
#[tokio::test]
#[ignore] // Requires libvirt access — run with: cargo test -- --ignored
async fn test_linux_vm_boots_from_iso() {
let executor = KvmExecutor::from_env().unwrap();
// Create isolated network
let network = NetworkConfig {
name: "e2e-test-net".to_string(),
bridge: "virbr200".to_string(),
// ...
};
executor.ensure_network(&network).await.unwrap();
// Define and start VM
let vm_config = VmConfig::builder("e2e-linux-test")
.vcpus(1)
.memory_gb(1)
.disk(5)
.network(NetworkRef::named("e2e-test-net"))
.cdrom("https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso")
.boot_order([BootDevice::Cdrom, BootDevice::Disk])
.build();
executor.ensure_vm(&vm_config).await.unwrap();
executor.start_vm("e2e-linux-test").await.unwrap();
// Assert VM is running
let status = executor.vm_status("e2e-linux-test").await.unwrap();
assert_eq!(status, VmStatus::Running);
// Cleanup
executor.destroy_vm("e2e-linux-test").await.unwrap();
executor.undefine_vm("e2e-linux-test").await.unwrap();
executor.delete_network("e2e-test-net").await.unwrap();
}
```
This test validates:
- ISO download works (via `harmony_assets` if refactored, or built-in KVM module download)
- libvirt XML generation is correct
- VM lifecycle (define → start → status → destroy → undefine)
- Network creation/deletion
### 6.2 OKD HA Cluster E2E test
The full integration test. This is long-running (30-60 minutes) and should only run nightly or on-demand.
```rust
// tests/e2e/tests/kvm_okd_ha.rs
#[tokio::test]
#[ignore] // Requires KVM + significant resources. Run nightly.
async fn test_okd_ha_cluster_on_kvm() {
// 1. Create virtual infrastructure
// - OPNsense gateway VM
// - 3 control plane VMs
// - 3 worker VMs
// - Virtual network (harmonylan)
// 2. Run OKD installation scores
// (the kvm_okd_ha_cluster example, but as a test)
// 3. Wait for OKD API server to become reachable
// - Poll https://api.okd.harmonylan:6443 until it responds
// - Timeout: 30 minutes
// 4. Assert cluster health
// - All nodes in Ready state
// - ClusterVersion reports Available=True
// - Sample workload (nginx) deploys and pod reaches Running
// 5. Cleanup
// - Destroy all VMs
// - Delete virtual networks
// - Clean up disk images
}
```
### 6.3 CI runner requirements
The KVM E2E test needs a runner with:
- **Hardware**: 32GB+ RAM, 8+ CPU cores, 100GB+ disk
- **Software**: libvirt, QEMU/KVM, `virsh`, nested virtualization enabled
- **Network**: Outbound internet access (to download ISOs, OKD images)
- **Permissions**: User in `libvirt` group, or root access
Options:
- **Dedicated bare-metal machine** registered as a self-hosted GitHub Actions runner
- **Cloud VM with nested virt** (e.g., GCP n2-standard-8 with `--enable-nested-virtualization`)
- **Manual trigger only** — developer runs locally, CI just tracks pass/fail
### 6.4 Nightly CI job
```yaml
# .github/workflows/e2e-kvm.yml
name: E2E KVM Tests
on:
schedule:
- cron: '0 2 * * *' # 2 AM daily
workflow_dispatch: # Manual trigger
jobs:
kvm-tests:
runs-on: [self-hosted, kvm] # Label for KVM-capable runners
timeout-minutes: 90
steps:
- uses: actions/checkout@v4
- name: Run KVM E2E tests
run: cargo test -p harmony-e2e-tests -- --ignored --test-threads=1
env:
RUST_LOG: info
HARMONY_KVM_URI: qemu:///system
- name: Cleanup VMs on failure
if: failure()
run: |
virsh list --all --name | grep e2e | xargs -I {} virsh destroy {} || true
virsh list --all --name | grep e2e | xargs -I {} virsh undefine {} --remove-all-storage || true
```
### 6.5 Test resource management
KVM tests create real resources that must be cleaned up even on failure. Implement a test fixture pattern:
```rust
struct KvmTestFixture {
executor: KvmExecutor,
vms: Vec<String>,
networks: Vec<String>,
}
impl KvmTestFixture {
fn track_vm(&mut self, name: &str) { self.vms.push(name.to_string()); }
fn track_network(&mut self, name: &str) { self.networks.push(name.to_string()); }
}
impl Drop for KvmTestFixture {
fn drop(&mut self) {
// Best-effort cleanup of all tracked resources
for vm in &self.vms {
let _ = std::process::Command::new("virsh")
.args(["destroy", vm]).output();
let _ = std::process::Command::new("virsh")
.args(["undefine", vm, "--remove-all-storage"]).output();
}
for net in &self.networks {
let _ = std::process::Command::new("virsh")
.args(["net-destroy", net]).output();
let _ = std::process::Command::new("virsh")
.args(["net-undefine", net]).output();
}
}
}
```
## Deliverables
- [ ] `test_linux_vm_boots_from_iso` — passing KVM smoke test
- [ ] `test_okd_ha_cluster_on_kvm` — full OKD installation test
- [ ] `KvmTestFixture` with resource cleanup on test failure
- [ ] Nightly CI job on KVM-capable runner
- [ ] Force-cleanup script for leaked VMs/networks
- [ ] Documentation: how to set up a KVM runner for E2E tests