Files
harmony/ROADMAP/iot_platform/arm_vm_plan.md

10 KiB
Raw Permalink Blame History

aarch64 VM support — plan

Why

The v0 walking skeleton's whole point is validating the IoT agent against the actual distribution, arch, and package set the end- customer's Pi 5 devices run on (ROADMAP §1). Everything green so far runs the agent against an x86_64 Ubuntu cloud image with an x86_64 Rust binary — which proves the code path works but not that the ARM target works. Every passing smoke-a3 run today is evidence that the wrong thing works.

This plan adds arm64 emulation on x86_64 hosts (no hardware needed for CI) so:

  • the VM runs the same Ubuntu 24.04 arm64 cloud image customers will eventually flash onto a Pi;
  • the iot-agent shipped to it is a real aarch64 binary produced by our existing cross-compile toolchain;
  • apt/systemd/podman on the VM are the actual arm64 packages; and
  • smoke-a3 exercises all of it end-to-end.

Acceptable cost: emulated boot is 5-15× slower than KVM-accelerated boot. That's the price of the target-arch validation.

Shape of the change

Additive, type-safe, default-preserving. Existing callers of VirtualMachineSpec keep working unchanged; arm64 is opt-in via a new field.

1. Architecture enum on the VM spec

Introduce VmArchitecture in harmony/src/domain/topology/ virtualization.rs:

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
pub enum VmArchitecture {
    #[default]
    X86_64,
    Aarch64,
}

Add pub architecture: VmArchitecture to VirtualMachineSpec. With #[derive(Default)] + VmArchitecture::X86_64 as default, every existing call site that uses struct init continues to compile. New constructor: VirtualMachineSpec::new_aarch64(name) for clarity.

Same treatment on VmConfig in modules/kvm/types.rs — add a pub architecture: VmArchitecture field with Default impl.

2. Libvirt XML parameterization

Rewrite modules/kvm/xml.rs::domain_xml to branch on arch. What changes per-arch (the QEMU flags you gave as reference map directly to libvirt XML):

QEMU flag libvirt XML x86_64 aarch64
-accel kvm vs -accel tcg <domain type='…'> kvm qemu
-M virt / -M q35 <os><type machine='…'> q35 virt
arch <os><type arch='…'> x86_64 aarch64
emulator binary <emulator>…</emulator> /usr/bin/qemu-system-x86_64 /usr/bin/qemu-system-aarch64
-cpu max,pauth-impdef=on <cpu mode='custom'><model>max</model><feature …/></cpu> host-model (current) max + pauth-impdef
-bios QEMU_EFI.fd <os><loader readonly='yes' type='pflash'>…</loader><nvram>…</nvram></os> — (BIOS) AAVMF CODE + VARS pflash pair
-accel tcg,thread=multi MTTCG is default-on when type='qemu' + QEMU ≥ 9.1 n/a implicit

Type safety: introduce a DomainXmlParams struct that captures the arch-specific knobs (domain_type, arch, machine, emulator path, cpu mode, firmware) and derives from VmArchitecture. The top-level domain_xml then consumes a fully-resolved DomainXmlParams rather than branching with if arch == X86_64 strings.

3. UEFI firmware discovery

aarch64 guests boot via UEFI, not BIOS. libvirt needs two files:

  • AAVMF_CODE.fd — the firmware code (read-only, shared)
  • AAVMF_VARS.fd — per-VM NVRAM (writable, per-domain copy)

Common paths across distros:

Distro CODE VARS (template)
Arch /usr/share/edk2/aarch64/QEMU_CODE.fd /usr/share/edk2/aarch64/QEMU_VARS.fd
Debian/Ubuntu /usr/share/AAVMF/AAVMF_CODE.fd /usr/share/AAVMF/AAVMF_VARS.fd
Fedora /usr/share/edk2/aarch64/QEMU_EFI-pflash.raw /usr/share/edk2/aarch64/vars-template-pflash.raw

New module harmony/src/modules/kvm/firmware.rs:

  • pub fn discover_aarch64_firmware() -> Result<AarchFirmware, KvmError> walks a small known-paths list and returns the first viable pair. Returns a typed AarchFirmware { code: PathBuf, vars_template: PathBuf }.
  • Per-VM NVRAM copy is handled in KvmVirtualMachineHost: at ensure_vm time, copy vars_template into $pool/<vm_name>-VARS.fd and reference it in the domain XML.

4. Cloud image for arm64

Add to modules/iot/assets.rs:

pub const UBUNTU_2404_CLOUDIMG_ARM64_URL: &str =
    "https://cloud-images.ubuntu.com/releases/24.04/release/ubuntu-24.04-server-cloudimg-arm64.img";
pub const UBUNTU_2404_CLOUDIMG_ARM64_SHA256: &str = "<pinned>";

pub async fn ensure_ubuntu_2404_cloud_image_for_arch(
    arch: VmArchitecture,
) -> Result<PathBuf, ExecutorError>;

The existing ensure_ubuntu_2404_cloud_image() becomes a thin wrapper that calls the arch-aware fn with X86_64, preserving all callers. SHA256 gets pinned against the live Ubuntu arm64 image at commit time.

5. Preflight additions

In modules/iot/preflight.rs, when the caller asks for arm64 VMs (new check_iot_smoke_preflight_for_arch(VmArchitecture) wrapper):

  • verify qemu-system-aarch64 is on PATH;
  • verify the aarch64 firmware pair exists (reuse the discovery fn);
  • verify QEMU version ≥ 9.1 (MTTCG is a real perf multiplier — a warning, not a hard block, if the host is older).

6. Cross-compiled agent

smoke-a3.sh phase 2 currently does native cargo build --release -p iot-agent-v0. When arch=aarch64:

  • cargo build --release --target aarch64-unknown-linux-gnu -p iot-agent-v0
  • AGENT_BINARY points at target/aarch64-unknown-linux-gnu/release/ iot-agent-v0

Opt-in via --arch aarch64 CLI flag on both example_iot_vm_setup and smoke-a3.sh. Default stays x86_64.

7. Timeout bumps

First-boot cloud-init on emulated aarch64 takes 3-6× longer than KVM-accel x86_64. Bump wait_for_ip timeout from 300s → 900s when arch=aarch64. Smoke-a3's phase 5 reboot gate also lengthens.

Files to touch

File Change
harmony/src/domain/topology/virtualization.rs Add VmArchitecture, field on VirtualMachineSpec, constructor helper.
harmony/src/modules/kvm/types.rs Add architecture field on VmConfig, VmConfigBuilder setter.
harmony/src/modules/kvm/xml.rs Rewrite domain_xml to take DomainXmlParams resolved from arch.
harmony/src/modules/kvm/firmware.rs (new) Discovery of AAVMF code+vars paths; AarchFirmware struct.
harmony/src/modules/kvm/topology.rs Copy per-VM NVRAM template on ensure_vm; thread arch through to XML.
harmony/src/modules/iot/assets.rs ensure_ubuntu_2404_cloud_image_for_arch(arch); pin arm64 URL+sha256.
harmony/src/modules/iot/preflight.rs Arch-aware preflight; qemu-system-aarch64 + firmware + qemu-version.
examples/iot_vm_setup/src/main.rs `--arch x86_64
iot/scripts/smoke-a3.sh Arch flag plumbing; cross-compile; extended timeouts; preflight.
iot/scripts/smoke-a3-arm.sh (new) Dedicated arm smoke as the CI hook — ARCH=aarch64 ./smoke-a3.sh.

Out of scope

  • Migrating OPNsense + other KVM examples to VirtualMachineHost / ProvisionVmScore — real inconsistency in the codebase but a separate refactor, orthogonal to the ARM work. Filing as follow-up.
  • KVM-accelerated aarch64-on-aarch64 (e.g. running on an ampere runner). Emulation covers the x86 CI story; native aarch64 runners would use <domain type='kvm'> and no MTTCG flags, which the arch enum + existing x86_64 XML path already model — so this is effectively free when we get there.
  • Supporting multiple simultaneous guest arches on one host in the same smoke run. Single-arch-per-run keeps everything simple.
  • Pinning AAVMF firmware like we pin the cloud image. Firmware is distro-package-managed; pin when we hit a regression.

Commit plan (in order)

  1. VmArchitecture domain type + VirtualMachineSpec.architecture field — tiny, just the enum and struct field; no behaviour change (all callers get X86_64 via Default).

  2. XML parameterization via DomainXmlParams — rewrite domain_xml to be arch-driven. Tests under harmony/src/modules/kvm/xml.rs get an arm64 variant.

  3. AAVMF firmware discovery + per-VM NVRAM copyfirmware.rs + the copy in topology.rs::ensure_vm.

  4. arm64 cloud image asset + preflightensure_ubuntu_2404_cloud_image_for_arch(arch) plus preflight extensions. SHA256 pinned at commit time via a one-off curl | sha256sum.

  5. Example + smoke script plumbing--arch flag, cross-compile, timeout bumps, smoke-a3-arm.sh wrapper.

  6. End-to-end verification — run smoke-a3-arm.sh from a fresh $HARMONY_DATA_DIR/iot/ and confirm the aarch64 agent boots, joins NATS, and survives a power-cycle. Document timing in the commit message.

Verification

  • cargo check --all-targets --features kvm: clean.
  • cargo clippy --no-deps -- -D warnings on touched files: clean.
  • cargo fmt --check: clean.
  • aarch64 cross-compile of harmony + iot crates: still green.
  • Fresh-cache arm64 smoke-a3: PASS, timing documented.
  • Existing x86_64 smoke-a3: still PASS (regression guard).