Files
harmony/ROADMAP/fleet_platform/arm_vm_plan.md
Jean-Gabriel Gill-Couture 7c1fedb303
All checks were successful
Run Check Script / check (pull_request) Successful in 2m25s
refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-*
The IoT vocabulary was anchoring the codebase to one customer's
domain. The reconciler pattern is generic — operator in k8s, NATS
KV as desired-state bus, agents reconciling podman / OKD / KVM /
anything that can register. "Fleet" captures that neutrally; IoT
stays acknowledged in docs as the first customer use case.

Done now, while nothing is deployed. After a partner fleet lands,
changing the CRD group alone is a multi-quarter migration.

Scope (nothing left over):

Paths + crates
- iot/ → fleet/
- iot/iot-operator-v0 → fleet/harmony-fleet-operator
- iot/iot-agent-v0 → fleet/harmony-fleet-agent
- harmony/src/modules/iot → harmony/src/modules/fleet
- ROADMAP/iot_platform → ROADMAP/fleet_platform
- examples/iot_{vm_setup, load_test, nats_install} → examples/fleet_*
- -v0 suffix dropped on the operator + agent crates (semver in
  Cargo.toml already tracks version)

Rust identifiers
- enum IotScore (podman score payload) → ReconcileScore
- struct IotDeviceSetupScore/Config → FleetDeviceSetupScore/Config
- InterpretName::IotDeviceSetup → InterpretName::FleetDeviceSetup
- HarmonyIotPool → HarmonyFleetPool (libvirt pool)
- HARMONY_IOT_POOL_NAME (default "harmony-iot") → HARMONY_FLEET_POOL_NAME ("harmony-fleet")
- IotSshKeypair → FleetSshKeypair
- ensure_iot_ssh_keypair / ensure_harmony_iot_pool /
  check_iot_smoke_preflight_for_arch → fleet-prefixed variants

Wire / config surfaces
- CRD group `iot.nationtech.io` → `fleet.nationtech.io`
- Finalizer `iot.nationtech.io/finalizer` → `fleet.nationtech.io/finalizer`
- Shortnames iotdep/iotdevice → fleetdep/fleetdev
- Env var IOT_AGENT_CONFIG → FLEET_AGENT_CONFIG
- Env var IOT_VM_ADMIN_PASSWORD → FLEET_VM_ADMIN_PASSWORD
- Binary /usr/local/bin/iot-agent → /usr/local/bin/fleet-agent
- Systemd user `iot-agent` → `fleet-agent`
- VM admin user `iot-admin` → `fleet-admin`

Defaults
- Namespaces iot-system/iot-demo/iot-load → fleet-system/fleet-demo/fleet-load
- Helm release iot-nats → fleet-nats
- Helm release iot-operator-v0 → harmony-fleet-operator
- Container image localhost/iot-operator-v0:latest →
  localhost/harmony-fleet-operator:latest
- On-disk cache $HARMONY_DATA_DIR/iot/ → $HARMONY_DATA_DIR/fleet/
  (cloud-images, ssh keypairs, libvirt pool)

What stayed
- harmony-reconciler-contracts — already neutrally named
- Wire types (DeviceInfo, DeploymentState, HeartbeatPayload,
  DeploymentName) — already neutral
- KV buckets (device-info, device-state, device-heartbeat,
  desired-state) — already neutral
- CRD kind names (Deployment, Device) — already neutral
- NatsBasicScore / NatsHelmChartScore / HelmChart / etc. —
  framework-scope, unchanged

Verification
- cargo check --workspace --all-targets: clean
- All harmony lib tests (114), fleet-operator (6), fleet-agent
  (7), harmony-reconciler-contracts (13): green
- End-to-end load-test (20 devices / 3 CRs / 20s under
  fleet/scripts/load-test.sh): PASS. Image built as
  localhost/harmony-fleet-operator:latest, chart installed as
  release harmony-fleet-operator in namespace fleet-system,
  all CR aggregates correct.

Zero stragglers: grep across the tree for \biot\b / IOT_ /
\bIot[A-Z] returns empty (excluding docs explicitly talking about
IoT as the first customer's domain).
2026-04-23 11:10:10 -04:00

10 KiB
Raw Permalink Blame History

aarch64 VM support — plan

Why

The v0 walking skeleton's whole point is validating the IoT agent against the actual distribution, arch, and package set the end- customer's Pi 5 devices run on (ROADMAP §1). Everything green so far runs the agent against an x86_64 Ubuntu cloud image with an x86_64 Rust binary — which proves the code path works but not that the ARM target works. Every passing smoke-a3 run today is evidence that the wrong thing works.

This plan adds arm64 emulation on x86_64 hosts (no hardware needed for CI) so:

  • the VM runs the same Ubuntu 24.04 arm64 cloud image customers will eventually flash onto a Pi;
  • the fleet-agent shipped to it is a real aarch64 binary produced by our existing cross-compile toolchain;
  • apt/systemd/podman on the VM are the actual arm64 packages; and
  • smoke-a3 exercises all of it end-to-end.

Acceptable cost: emulated boot is 5-15× slower than KVM-accelerated boot. That's the price of the target-arch validation.

Shape of the change

Additive, type-safe, default-preserving. Existing callers of VirtualMachineSpec keep working unchanged; arm64 is opt-in via a new field.

1. Architecture enum on the VM spec

Introduce VmArchitecture in harmony/src/domain/topology/ virtualization.rs:

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
pub enum VmArchitecture {
    #[default]
    X86_64,
    Aarch64,
}

Add pub architecture: VmArchitecture to VirtualMachineSpec. With #[derive(Default)] + VmArchitecture::X86_64 as default, every existing call site that uses struct init continues to compile. New constructor: VirtualMachineSpec::new_aarch64(name) for clarity.

Same treatment on VmConfig in modules/kvm/types.rs — add a pub architecture: VmArchitecture field with Default impl.

2. Libvirt XML parameterization

Rewrite modules/kvm/xml.rs::domain_xml to branch on arch. What changes per-arch (the QEMU flags you gave as reference map directly to libvirt XML):

QEMU flag libvirt XML x86_64 aarch64
-accel kvm vs -accel tcg <domain type='…'> kvm qemu
-M virt / -M q35 <os><type machine='…'> q35 virt
arch <os><type arch='…'> x86_64 aarch64
emulator binary <emulator>…</emulator> /usr/bin/qemu-system-x86_64 /usr/bin/qemu-system-aarch64
-cpu max,pauth-impdef=on <cpu mode='custom'><model>max</model><feature …/></cpu> host-model (current) max + pauth-impdef
-bios QEMU_EFI.fd <os><loader readonly='yes' type='pflash'>…</loader><nvram>…</nvram></os> — (BIOS) AAVMF CODE + VARS pflash pair
-accel tcg,thread=multi MTTCG is default-on when type='qemu' + QEMU ≥ 9.1 n/a implicit

Type safety: introduce a DomainXmlParams struct that captures the arch-specific knobs (domain_type, arch, machine, emulator path, cpu mode, firmware) and derives from VmArchitecture. The top-level domain_xml then consumes a fully-resolved DomainXmlParams rather than branching with if arch == X86_64 strings.

3. UEFI firmware discovery

aarch64 guests boot via UEFI, not BIOS. libvirt needs two files:

  • AAVMF_CODE.fd — the firmware code (read-only, shared)
  • AAVMF_VARS.fd — per-VM NVRAM (writable, per-domain copy)

Common paths across distros:

Distro CODE VARS (template)
Arch /usr/share/edk2/aarch64/QEMU_CODE.fd /usr/share/edk2/aarch64/QEMU_VARS.fd
Debian/Ubuntu /usr/share/AAVMF/AAVMF_CODE.fd /usr/share/AAVMF/AAVMF_VARS.fd
Fedora /usr/share/edk2/aarch64/QEMU_EFI-pflash.raw /usr/share/edk2/aarch64/vars-template-pflash.raw

New module harmony/src/modules/kvm/firmware.rs:

  • pub fn discover_aarch64_firmware() -> Result<AarchFirmware, KvmError> walks a small known-paths list and returns the first viable pair. Returns a typed AarchFirmware { code: PathBuf, vars_template: PathBuf }.
  • Per-VM NVRAM copy is handled in KvmVirtualMachineHost: at ensure_vm time, copy vars_template into $pool/<vm_name>-VARS.fd and reference it in the domain XML.

4. Cloud image for arm64

Add to modules/iot/assets.rs:

pub const UBUNTU_2404_CLOUDIMG_ARM64_URL: &str =
    "https://cloud-images.ubuntu.com/releases/24.04/release/ubuntu-24.04-server-cloudimg-arm64.img";
pub const UBUNTU_2404_CLOUDIMG_ARM64_SHA256: &str = "<pinned>";

pub async fn ensure_ubuntu_2404_cloud_image_for_arch(
    arch: VmArchitecture,
) -> Result<PathBuf, ExecutorError>;

The existing ensure_ubuntu_2404_cloud_image() becomes a thin wrapper that calls the arch-aware fn with X86_64, preserving all callers. SHA256 gets pinned against the live Ubuntu arm64 image at commit time.

5. Preflight additions

In modules/iot/preflight.rs, when the caller asks for arm64 VMs (new check_iot_smoke_preflight_for_arch(VmArchitecture) wrapper):

  • verify qemu-system-aarch64 is on PATH;
  • verify the aarch64 firmware pair exists (reuse the discovery fn);
  • verify QEMU version ≥ 9.1 (MTTCG is a real perf multiplier — a warning, not a hard block, if the host is older).

6. Cross-compiled agent

smoke-a3.sh phase 2 currently does native cargo build --release -p fleet-agent-v0. When arch=aarch64:

  • cargo build --release --target aarch64-unknown-linux-gnu -p fleet-agent-v0
  • AGENT_BINARY points at target/aarch64-unknown-linux-gnu/release/ fleet-agent-v0

Opt-in via --arch aarch64 CLI flag on both example_iot_vm_setup and smoke-a3.sh. Default stays x86_64.

7. Timeout bumps

First-boot cloud-init on emulated aarch64 takes 3-6× longer than KVM-accel x86_64. Bump wait_for_ip timeout from 300s → 900s when arch=aarch64. Smoke-a3's phase 5 reboot gate also lengthens.

Files to touch

File Change
harmony/src/domain/topology/virtualization.rs Add VmArchitecture, field on VirtualMachineSpec, constructor helper.
harmony/src/modules/kvm/types.rs Add architecture field on VmConfig, VmConfigBuilder setter.
harmony/src/modules/kvm/xml.rs Rewrite domain_xml to take DomainXmlParams resolved from arch.
harmony/src/modules/kvm/firmware.rs (new) Discovery of AAVMF code+vars paths; AarchFirmware struct.
harmony/src/modules/kvm/topology.rs Copy per-VM NVRAM template on ensure_vm; thread arch through to XML.
harmony/src/modules/iot/assets.rs ensure_ubuntu_2404_cloud_image_for_arch(arch); pin arm64 URL+sha256.
harmony/src/modules/iot/preflight.rs Arch-aware preflight; qemu-system-aarch64 + firmware + qemu-version.
examples/fleet_vm_setup/src/main.rs `--arch x86_64
fleet/scripts/smoke-a3.sh Arch flag plumbing; cross-compile; extended timeouts; preflight.
fleet/scripts/smoke-a3-arm.sh (new) Dedicated arm smoke as the CI hook — ARCH=aarch64 ./smoke-a3.sh.

Out of scope

  • Migrating OPNsense + other KVM examples to VirtualMachineHost / ProvisionVmScore — real inconsistency in the codebase but a separate refactor, orthogonal to the ARM work. Filing as follow-up.
  • KVM-accelerated aarch64-on-aarch64 (e.g. running on an ampere runner). Emulation covers the x86 CI story; native aarch64 runners would use <domain type='kvm'> and no MTTCG flags, which the arch enum + existing x86_64 XML path already model — so this is effectively free when we get there.
  • Supporting multiple simultaneous guest arches on one host in the same smoke run. Single-arch-per-run keeps everything simple.
  • Pinning AAVMF firmware like we pin the cloud image. Firmware is distro-package-managed; pin when we hit a regression.

Commit plan (in order)

  1. VmArchitecture domain type + VirtualMachineSpec.architecture field — tiny, just the enum and struct field; no behaviour change (all callers get X86_64 via Default).

  2. XML parameterization via DomainXmlParams — rewrite domain_xml to be arch-driven. Tests under harmony/src/modules/kvm/xml.rs get an arm64 variant.

  3. AAVMF firmware discovery + per-VM NVRAM copyfirmware.rs + the copy in topology.rs::ensure_vm.

  4. arm64 cloud image asset + preflightensure_ubuntu_2404_cloud_image_for_arch(arch) plus preflight extensions. SHA256 pinned at commit time via a one-off curl | sha256sum.

  5. Example + smoke script plumbing--arch flag, cross-compile, timeout bumps, smoke-a3-arm.sh wrapper.

  6. End-to-end verification — run smoke-a3-arm.sh from a fresh $HARMONY_DATA_DIR/iot/ and confirm the aarch64 agent boots, joins NATS, and survives a power-cycle. Document timing in the commit message.

Verification

  • cargo check --all-targets --features kvm: clean.
  • cargo clippy --no-deps -- -D warnings on touched files: clean.
  • cargo fmt --check: clean.
  • aarch64 cross-compile of harmony + iot crates: still green.
  • Fresh-cache arm64 smoke-a3: PASS, timing documented.
  • Existing x86_64 smoke-a3: still PASS (regression guard).