Files
harmony/fleet/scripts/smoke-a3.sh
Jean-Gabriel Gill-Couture 7c1fedb303
All checks were successful
Run Check Script / check (pull_request) Successful in 2m25s
refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-*
The IoT vocabulary was anchoring the codebase to one customer's
domain. The reconciler pattern is generic — operator in k8s, NATS
KV as desired-state bus, agents reconciling podman / OKD / KVM /
anything that can register. "Fleet" captures that neutrally; IoT
stays acknowledged in docs as the first customer use case.

Done now, while nothing is deployed. After a partner fleet lands,
changing the CRD group alone is a multi-quarter migration.

Scope (nothing left over):

Paths + crates
- iot/ → fleet/
- iot/iot-operator-v0 → fleet/harmony-fleet-operator
- iot/iot-agent-v0 → fleet/harmony-fleet-agent
- harmony/src/modules/iot → harmony/src/modules/fleet
- ROADMAP/iot_platform → ROADMAP/fleet_platform
- examples/iot_{vm_setup, load_test, nats_install} → examples/fleet_*
- -v0 suffix dropped on the operator + agent crates (semver in
  Cargo.toml already tracks version)

Rust identifiers
- enum IotScore (podman score payload) → ReconcileScore
- struct IotDeviceSetupScore/Config → FleetDeviceSetupScore/Config
- InterpretName::IotDeviceSetup → InterpretName::FleetDeviceSetup
- HarmonyIotPool → HarmonyFleetPool (libvirt pool)
- HARMONY_IOT_POOL_NAME (default "harmony-iot") → HARMONY_FLEET_POOL_NAME ("harmony-fleet")
- IotSshKeypair → FleetSshKeypair
- ensure_iot_ssh_keypair / ensure_harmony_iot_pool /
  check_iot_smoke_preflight_for_arch → fleet-prefixed variants

Wire / config surfaces
- CRD group `iot.nationtech.io` → `fleet.nationtech.io`
- Finalizer `iot.nationtech.io/finalizer` → `fleet.nationtech.io/finalizer`
- Shortnames iotdep/iotdevice → fleetdep/fleetdev
- Env var IOT_AGENT_CONFIG → FLEET_AGENT_CONFIG
- Env var IOT_VM_ADMIN_PASSWORD → FLEET_VM_ADMIN_PASSWORD
- Binary /usr/local/bin/iot-agent → /usr/local/bin/fleet-agent
- Systemd user `iot-agent` → `fleet-agent`
- VM admin user `iot-admin` → `fleet-admin`

Defaults
- Namespaces iot-system/iot-demo/iot-load → fleet-system/fleet-demo/fleet-load
- Helm release iot-nats → fleet-nats
- Helm release iot-operator-v0 → harmony-fleet-operator
- Container image localhost/iot-operator-v0:latest →
  localhost/harmony-fleet-operator:latest
- On-disk cache $HARMONY_DATA_DIR/iot/ → $HARMONY_DATA_DIR/fleet/
  (cloud-images, ssh keypairs, libvirt pool)

What stayed
- harmony-reconciler-contracts — already neutrally named
- Wire types (DeviceInfo, DeploymentState, HeartbeatPayload,
  DeploymentName) — already neutral
- KV buckets (device-info, device-state, device-heartbeat,
  desired-state) — already neutral
- CRD kind names (Deployment, Device) — already neutral
- NatsBasicScore / NatsHelmChartScore / HelmChart / etc. —
  framework-scope, unchanged

Verification
- cargo check --workspace --all-targets: clean
- All harmony lib tests (114), fleet-operator (6), fleet-agent
  (7), harmony-reconciler-contracts (13): green
- End-to-end load-test (20 devices / 3 CRs / 20s under
  fleet/scripts/load-test.sh): PASS. Image built as
  localhost/harmony-fleet-operator:latest, chart installed as
  release harmony-fleet-operator in namespace fleet-system,
  all CR aggregates correct.

Zero stragglers: grep across the tree for \biot\b / IOT_ /
\bIot[A-Z] returns empty (excluding docs explicitly talking about
IoT as the first customer's domain).
2026-04-23 11:10:10 -04:00

202 lines
8.2 KiB
Bash
Executable File
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# End-to-end smoke test for the VM-as-device flow.
#
# [libvirt qemu:///system] ──KvmVmScore──▶ VM (Ubuntu 24.04, cloud-init'd)
# │
# ssh+Ansible ◀────┘
# │
# ▼
# FleetDeviceSetupScore ──▶ podman + fleet-agent on VM
# │
# ▼
# existing operator ──NATS────────┘ (agent joins fleet, reconciles CR)
# │
# ▼ [phase 5]
# virsh reboot → agent reconnects
#
# Prerequisites on the runner host — all one-time, all generic:
# 1. libvirt + qemu + xorriso + python3 + podman + cargo + kubectl
# (Arch: pacman -S libvirt qemu-full libisoburn python podman
# Debian/Ubuntu: apt install libvirt-daemon-system qemu-kvm
# xorriso python3 python3-venv podman)
# 2. Be in the `libvirt` group (`sudo usermod -aG libvirt $USER`)
# 3. `sudo virsh net-start default && sudo virsh net-autostart default`
#
# Harmony handles *everything else*: cloud image download, SSH key
# generation, libvirt pool creation, ansible install, agent build.
# First run costs ~2 min to populate caches; subsequent runs hit the
# cache in <1 s.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
VM_NAME="${VM_NAME:-fleet-smoke-vm}"
DEVICE_ID="${DEVICE_ID:-$VM_NAME}"
GROUP="${GROUP:-group-a}"
LIBVIRT_URI="${LIBVIRT_URI:-qemu:///system}"
# Guest architecture. `x86-64` runs native KVM; `aarch64` runs under
# qemu-system-aarch64 TCG on x86 hosts (3-5× slower but exercises the
# real Pi target). Changes: cloud-image URL, qemu binary, agent build
# target, phase 4 timeout.
ARCH="${ARCH:-x86-64}"
NATS_CONTAINER="${NATS_CONTAINER:-fleet-smoke-nats-a3}"
NATS_NET_NAME="${NATS_NET_NAME:-fleet-smoke-net-a3}"
NATS_IMAGE="${NATS_IMAGE:-docker.io/library/nats:2.10-alpine}"
NATS_PORT="${NATS_PORT:-4222}"
log() { printf '\033[1;34m[smoke-a3]\033[0m %s\n' "$*"; }
fail() { printf '\033[1;31m[smoke-a3 FAIL]\033[0m %s\n' "$*" >&2; exit 1; }
case "$ARCH" in
x86-64|x86_64) EXAMPLE_ARCH=x86-64; AGENT_TARGET= ;;
aarch64|arm64) EXAMPLE_ARCH=aarch64; AGENT_TARGET=aarch64-unknown-linux-gnu ;;
*) fail "unsupported ARCH=$ARCH (expected: x86-64 | aarch64)" ;;
esac
cleanup() {
local rc=$?
log "cleanup…"
if [[ "${KEEP:-0}" != "1" ]]; then
virsh --connect "$LIBVIRT_URI" destroy "$VM_NAME" 2>/dev/null || true
# `--nvram` is required for aarch64 domains (which have a
# per-VM NVRAM file); harmless on x86_64 where no NVRAM is
# registered. Without it, `undefine` refuses and the next
# run sees a stale domain with whatever XML the previous
# run defined — masking XML changes until manually cleaned.
virsh --connect "$LIBVIRT_URI" undefine --nvram \
--remove-all-storage "$VM_NAME" 2>/dev/null || true
podman rm -f "$NATS_CONTAINER" >/dev/null 2>&1 || true
podman network rm "$NATS_NET_NAME" >/dev/null 2>&1 || true
else
log "KEEP=1 — leaving VM '$VM_NAME' and NATS '$NATS_CONTAINER' running"
fi
exit $rc
}
trap cleanup EXIT INT TERM
require() { command -v "$1" >/dev/null 2>&1 || fail "missing required tool: $1"; }
require podman
require cargo
require virsh
# ---------------------------- phase 1: NATS ----------------------------
log "phase 1: start NATS container on host"
podman network exists "$NATS_NET_NAME" || podman network create "$NATS_NET_NAME" >/dev/null
podman rm -f "$NATS_CONTAINER" >/dev/null 2>&1 || true
podman run -d \
--name "$NATS_CONTAINER" \
--network "$NATS_NET_NAME" \
-p "$NATS_PORT:4222" \
"$NATS_IMAGE" -js >/dev/null
NAT_GW="$(virsh --connect "$LIBVIRT_URI" net-dumpxml default \
| grep -oP "ip address='\K[^']+" | head -1)"
[[ -n "$NAT_GW" ]] || fail "couldn't determine libvirt 'default' gateway IP"
log "libvirt network gateway = $NAT_GW (VM will dial NATS at nats://$NAT_GW:$NATS_PORT)"
# ---------------------------- phase 2: build ---------------------------
log "phase 2: build harmony-fleet-agent for guest arch=$ARCH (release — debug binary fills cloud rootfs)"
(
cd "$REPO_ROOT"
if [[ -n "$AGENT_TARGET" ]]; then
rustup target add "$AGENT_TARGET" >/dev/null
cargo build -q --release --target "$AGENT_TARGET" -p harmony-fleet-agent
else
cargo build -q --release -p harmony-fleet-agent
fi
)
if [[ -n "$AGENT_TARGET" ]]; then
AGENT_BINARY="$REPO_ROOT/target/$AGENT_TARGET/release/harmony-fleet-agent"
else
AGENT_BINARY="$REPO_ROOT/target/release/harmony-fleet-agent"
fi
[[ -f "$AGENT_BINARY" ]] || fail "agent binary missing after build: $AGENT_BINARY"
# ---------------------------- phase 3: bootstrap + provision + setup ----------------------------
log "phase 3: bootstrap assets + provision VM + onboard device (arch=$EXAMPLE_ARCH)"
(
cd "$REPO_ROOT"
cargo run -q --release -p example_fleet_vm_setup -- \
--arch "$EXAMPLE_ARCH" \
--vm-name "$VM_NAME" \
--device-id "$DEVICE_ID" \
--group "$GROUP" \
--agent-binary "$AGENT_BINARY" \
--nats-url "nats://$NAT_GW:$NATS_PORT"
)
# ---------------------------- phase 4: initial status ----------------------------
# TCG emulation slows agent boot + first NATS publish significantly.
# 60s is fine for native KVM but too tight for aarch64-on-x86.
case "$ARCH" in
aarch64|arm64) STATUS_TIMEOUT=300 ;;
*) STATUS_TIMEOUT=60 ;;
esac
log "phase 4: wait for agent to report heartbeat to NATS (timeout=${STATUS_TIMEOUT}s)"
wait_for_status() {
local timeout=$1
for _ in $(seq 1 "$timeout"); do
if podman run --rm --network "$NATS_NET_NAME" \
docker.io/natsio/nats-box:latest \
nats --server "nats://$NATS_CONTAINER:4222" kv get device-heartbeat \
"heartbeat.$DEVICE_ID" --raw >/dev/null 2>&1; then
return 0
fi
sleep 1
done
return 1
}
wait_for_status "$STATUS_TIMEOUT" || fail "device-heartbeat never appeared for $DEVICE_ID"
log "agent heartbeat present on NATS"
# ---------------------------- phase 5: hard power-cycle, expect recovery ----------------------------
log "phase 5: power-cycle VM (virsh destroy + start) → agent must reconnect to NATS"
nats_status_timestamp() {
# Prints the "at" field of the heartbeat.<device> entry, or "".
# Never errors (for `set -e` safety).
podman run --rm --network "$NATS_NET_NAME" \
docker.io/natsio/nats-box:latest \
nats --server "nats://$NATS_CONTAINER:4222" kv get device-heartbeat \
"heartbeat.$DEVICE_ID" --raw 2>/dev/null \
| grep -oE '"at":"[^"]+"' \
| head -1 | cut -d'"' -f4 || true
}
virsh --connect "$LIBVIRT_URI" destroy "$VM_NAME" >/dev/null
# `virsh destroy` returns before the qemu process is fully torn down;
# wait a couple seconds to be sure the agent is dead and can't flush a
# final status update after our gate.
sleep 3
REBOOT_GATE="$(date -u +%Y-%m-%dT%H:%M:%S+00:00)"
log "reboot gate = $REBOOT_GATE (any agent timestamp > this is post-reboot)"
virsh --connect "$LIBVIRT_URI" start "$VM_NAME" >/dev/null
case "$ARCH" in
aarch64|arm64) REBOOT_STEPS=900 ;; # ~30 min under TCG
*) REBOOT_STEPS=120 ;; # ~4 min on native KVM
esac
log "waiting for agent to re-report status (post-reboot, up to $((REBOOT_STEPS*2))s)…"
TS_AFTER=""
for _ in $(seq 1 "$REBOOT_STEPS"); do
sleep 2
ts="$(nats_status_timestamp)"
# ISO-8601 timestamps compare correctly lexicographically when the
# format is identical. Both the agent and `date -u -Iseconds`
# produce RFC 3339 UTC strings so string > works.
if [[ -n "$ts" && "$ts" > "$REBOOT_GATE" ]]; then
TS_AFTER="$ts"
break
fi
done
if [[ -z "$TS_AFTER" ]]; then
fail "agent did not write a post-reboot status within ~$((REBOOT_STEPS*2))s (gate: $REBOOT_GATE)"
fi
log "post-reboot status seen at $TS_AFTER"
log "PASS — VM $VM_NAME power-cycled and re-onboarded (group=$GROUP)"