215 lines
7.2 KiB
Markdown
215 lines
7.2 KiB
Markdown
# Phase 6: E2E Tests for OKD HA Cluster on KVM
|
|
|
|
## Goal
|
|
|
|
Prove the full OKD bare-metal installation flow works end-to-end using KVM virtual machines. This is the ultimate validation of Harmony's core value proposition: declare an OKD cluster, point it at infrastructure, watch it materialize.
|
|
|
|
## Prerequisites
|
|
|
|
- Phase 5 complete (test harness exists, k3d tests passing)
|
|
- `feature/kvm-module` merged to main
|
|
- A CI runner with libvirt/KVM access and nested virtualization support
|
|
|
|
## Architecture
|
|
|
|
The KVM branch already has a `kvm_okd_ha_cluster` example that creates:
|
|
|
|
```
|
|
Host bridge (WAN)
|
|
|
|
|
+--------------------+
|
|
| OPNsense | 192.168.100.1
|
|
| gateway + PXE |
|
|
+--------+-----------+
|
|
|
|
|
harmonylan (192.168.100.0/24)
|
|
+---------+---------+---------+---------+
|
|
| | | | |
|
|
+----+---+ +---+---+ +---+---+ +---+---+ +--+----+
|
|
| cp0 | | cp1 | | cp2 | |worker0| |worker1|
|
|
| .10 | | .11 | | .12 | | .20 | | .21 |
|
|
+--------+ +-------+ +-------+ +-------+ +---+---+
|
|
|
|
|
+-----+----+
|
|
| worker2 |
|
|
| .22 |
|
|
+----------+
|
|
```
|
|
|
|
The test needs to orchestrate this entire setup, wait for OKD to converge, and assert the cluster is healthy.
|
|
|
|
## Tasks
|
|
|
|
### 6.1 Start with `example_linux_vm` — the simplest KVM test
|
|
|
|
Before tackling the full OKD stack, validate the KVM module itself with the simplest possible test:
|
|
|
|
```rust
|
|
// tests/e2e/tests/kvm_linux_vm.rs
|
|
|
|
#[tokio::test]
|
|
#[ignore] // Requires libvirt access — run with: cargo test -- --ignored
|
|
async fn test_linux_vm_boots_from_iso() {
|
|
let executor = KvmExecutor::from_env().unwrap();
|
|
|
|
// Create isolated network
|
|
let network = NetworkConfig {
|
|
name: "e2e-test-net".to_string(),
|
|
bridge: "virbr200".to_string(),
|
|
// ...
|
|
};
|
|
executor.ensure_network(&network).await.unwrap();
|
|
|
|
// Define and start VM
|
|
let vm_config = VmConfig::builder("e2e-linux-test")
|
|
.vcpus(1)
|
|
.memory_gb(1)
|
|
.disk(5)
|
|
.network(NetworkRef::named("e2e-test-net"))
|
|
.cdrom("https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso")
|
|
.boot_order([BootDevice::Cdrom, BootDevice::Disk])
|
|
.build();
|
|
|
|
executor.ensure_vm(&vm_config).await.unwrap();
|
|
executor.start_vm("e2e-linux-test").await.unwrap();
|
|
|
|
// Assert VM is running
|
|
let status = executor.vm_status("e2e-linux-test").await.unwrap();
|
|
assert_eq!(status, VmStatus::Running);
|
|
|
|
// Cleanup
|
|
executor.destroy_vm("e2e-linux-test").await.unwrap();
|
|
executor.undefine_vm("e2e-linux-test").await.unwrap();
|
|
executor.delete_network("e2e-test-net").await.unwrap();
|
|
}
|
|
```
|
|
|
|
This test validates:
|
|
- ISO download works (via `harmony_assets` if refactored, or built-in KVM module download)
|
|
- libvirt XML generation is correct
|
|
- VM lifecycle (define → start → status → destroy → undefine)
|
|
- Network creation/deletion
|
|
|
|
### 6.2 OKD HA Cluster E2E test
|
|
|
|
The full integration test. This is long-running (30-60 minutes) and should only run nightly or on-demand.
|
|
|
|
```rust
|
|
// tests/e2e/tests/kvm_okd_ha.rs
|
|
|
|
#[tokio::test]
|
|
#[ignore] // Requires KVM + significant resources. Run nightly.
|
|
async fn test_okd_ha_cluster_on_kvm() {
|
|
// 1. Create virtual infrastructure
|
|
// - OPNsense gateway VM
|
|
// - 3 control plane VMs
|
|
// - 3 worker VMs
|
|
// - Virtual network (harmonylan)
|
|
|
|
// 2. Run OKD installation scores
|
|
// (the kvm_okd_ha_cluster example, but as a test)
|
|
|
|
// 3. Wait for OKD API server to become reachable
|
|
// - Poll https://api.okd.harmonylan:6443 until it responds
|
|
// - Timeout: 30 minutes
|
|
|
|
// 4. Assert cluster health
|
|
// - All nodes in Ready state
|
|
// - ClusterVersion reports Available=True
|
|
// - Sample workload (nginx) deploys and pod reaches Running
|
|
|
|
// 5. Cleanup
|
|
// - Destroy all VMs
|
|
// - Delete virtual networks
|
|
// - Clean up disk images
|
|
}
|
|
```
|
|
|
|
### 6.3 CI runner requirements
|
|
|
|
The KVM E2E test needs a runner with:
|
|
|
|
- **Hardware**: 32GB+ RAM, 8+ CPU cores, 100GB+ disk
|
|
- **Software**: libvirt, QEMU/KVM, `virsh`, nested virtualization enabled
|
|
- **Network**: Outbound internet access (to download ISOs, OKD images)
|
|
- **Permissions**: User in `libvirt` group, or root access
|
|
|
|
Options:
|
|
- **Dedicated bare-metal machine** registered as a self-hosted GitHub Actions runner
|
|
- **Cloud VM with nested virt** (e.g., GCP n2-standard-8 with `--enable-nested-virtualization`)
|
|
- **Manual trigger only** — developer runs locally, CI just tracks pass/fail
|
|
|
|
### 6.4 Nightly CI job
|
|
|
|
```yaml
|
|
# .github/workflows/e2e-kvm.yml
|
|
name: E2E KVM Tests
|
|
on:
|
|
schedule:
|
|
- cron: '0 2 * * *' # 2 AM daily
|
|
workflow_dispatch: # Manual trigger
|
|
|
|
jobs:
|
|
kvm-tests:
|
|
runs-on: [self-hosted, kvm] # Label for KVM-capable runners
|
|
timeout-minutes: 90
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Run KVM E2E tests
|
|
run: cargo test -p harmony-e2e-tests -- --ignored --test-threads=1
|
|
env:
|
|
RUST_LOG: info
|
|
HARMONY_KVM_URI: qemu:///system
|
|
|
|
- name: Cleanup VMs on failure
|
|
if: failure()
|
|
run: |
|
|
virsh list --all --name | grep e2e | xargs -I {} virsh destroy {} || true
|
|
virsh list --all --name | grep e2e | xargs -I {} virsh undefine {} --remove-all-storage || true
|
|
```
|
|
|
|
### 6.5 Test resource management
|
|
|
|
KVM tests create real resources that must be cleaned up even on failure. Implement a test fixture pattern:
|
|
|
|
```rust
|
|
struct KvmTestFixture {
|
|
executor: KvmExecutor,
|
|
vms: Vec<String>,
|
|
networks: Vec<String>,
|
|
}
|
|
|
|
impl KvmTestFixture {
|
|
fn track_vm(&mut self, name: &str) { self.vms.push(name.to_string()); }
|
|
fn track_network(&mut self, name: &str) { self.networks.push(name.to_string()); }
|
|
}
|
|
|
|
impl Drop for KvmTestFixture {
|
|
fn drop(&mut self) {
|
|
// Best-effort cleanup of all tracked resources
|
|
for vm in &self.vms {
|
|
let _ = std::process::Command::new("virsh")
|
|
.args(["destroy", vm]).output();
|
|
let _ = std::process::Command::new("virsh")
|
|
.args(["undefine", vm, "--remove-all-storage"]).output();
|
|
}
|
|
for net in &self.networks {
|
|
let _ = std::process::Command::new("virsh")
|
|
.args(["net-destroy", net]).output();
|
|
let _ = std::process::Command::new("virsh")
|
|
.args(["net-undefine", net]).output();
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Deliverables
|
|
|
|
- [ ] `test_linux_vm_boots_from_iso` — passing KVM smoke test
|
|
- [ ] `test_okd_ha_cluster_on_kvm` — full OKD installation test
|
|
- [ ] `KvmTestFixture` with resource cleanup on test failure
|
|
- [ ] Nightly CI job on KVM-capable runner
|
|
- [ ] Force-cleanup script for leaked VMs/networks
|
|
- [ ] Documentation: how to set up a KVM runner for E2E tests
|