Epic: Fully Orchestrate OKD Installation with Automated Node Discovery and Network Provisioning #113

Open
opened 2025-08-19 00:57:37 +00:00 by johnride · 0 comments
Owner

Summary

This epic covers the work required to create a fully automated, end-to-end orchestration flow for deploying an OKD 4.19 cluster using Harmony. The primary goal is to eliminate manual, error-prone steps such as hardware inventory collection and PXE configuration. The process will leverage a two-phase approach: an initial discovery phase for unknown hardware and a provisioning phase that configures the node and network for its specific role in the cluster.

This implementation will solve the critical challenge of PXE booting nodes on switch ports that are destined to be part of an LACP bond.


Background & Problem Statement

The current process for setting up a new OKD cluster involves several manual and time-consuming tasks:

  1. Manual Inventory: Each server must be booted with a keyboard and display to manually collect hardware details (MAC addresses, NIC names, disk info, etc.). This is slow and prone to transcription errors.
  2. PXE/LACP Conflict: Pre-configuring switch ports for an LACP bond prevents the server's UEFI/BIOS from successfully making a DHCP request, which is a prerequisite for PXE booting. This creates a chicken-and-egg problem where the host cannot get the boot instructions needed to configure the very bond that would allow it to communicate.
  3. Manual Configuration: iPXE files and Ignition configurations have to be managed manually for each node role.

This epic aims to solve these problems by making the entire process a declarative, orchestrated workflow within Harmony.


Proposed Architecture

The solution is a two-phase process orchestrated by a new OKDInstallationScore.

Phase 1: Automated Node Discovery (Access Mode)

  1. Network State: All relevant switch ports are configured as simple access ports. They are not in a bond. This guarantees that any device can successfully perform a DHCP request and PXE boot.
  2. Default PXE Boot: A server with an unknown MAC address boots and receives the default.ipxe configuration from the PXE server.
  3. Live Inventory Environment: The iPXE script boots a standard CentOS Stream 9 live environment entirely in RAM. The boot is automated by a Kickstart (inventory.ks) file.
  4. Inventory Agent: The Kickstart file's %post script downloads and starts a purpose-built, statically-compiled Rust binary: the harmony-inventory-agent. This agent collects detailed hardware information and exposes it via a JSON HTTP endpoint.
  5. Harmony Discovery: The Harmony control plane scans the provisioning network, discovers the agent's endpoint, scrapes the hardware inventory, and displays the new, unassigned node in the Harmony TUI.

Phase 2: Role-Based OKD Provisioning (LACP Mode)

  1. User Action: The user assigns a role (e.g., bootstrap, master-0, worker-0) to the discovered node in the Harmony inventory.
  2. Bootstrap the cluster: Harmony will create the appropriate dhcp entries for hostname resolution and pxe files for the mac of the host chosen and load balancer entry as bootstrap node and then reboot the node to launch the process.
  • Harmony generates a MAC-specific iPXE file ({MAC}.ipxe) that points to the SCOS (Stream CoreOS) kernel and an Ignition file tailored for the assigned role.
  1. Initialize the control plane: Once the bootstrap process is done, generate pxe and dhcp entries for the mac addresses of the control plane nodes from the inventory and reboot them
  1. Delete the bootstrap node: Once the control plane is up and healthy, delete the bootstrap node from config and wipe the bootstrap node so it can be reused
  2. Join the worker nodes: At this point, the cluster is up and running, time to add workers and taint the control plane nodes so they don't run workloads

This architecture treats the network configuration as code and ensures that the host and network are always in a compatible state at every stage of the boot process.


To-Do List

1. Initial Setup & Prerequisites

  • Download CentOS Stream 9 boot.iso for the inventory environment.
  • Extract vmlinuz and initrd.img from the ISO and place them on the Caddy HTTP server under /os/centos-stream-9/.
  • Create a local mirror of the CentOS Stream 9 repository or ensure the inventory environment has internet access.
  • Download the required SCOS 4.19 release assets (rhcos-live-kernel-x86_64, rhcos-live-initramfs.x86_64.img, etc.) and place them on the Caddy server under /os/scos-4.19/.

2. Discovery Phase Implementation

  • Create a new Rust crate within the Harmony workspace named harmony-inventory-agent.
  • Implement hardware detection logic in the agent to gather:
    • CPU Info (model, sockets, cores, threads)
    • Memory Info (total, DIMM details)
    • Storage Info (disks, models, sizes, types)
    • Network Info (interface names, MAC addresses, link speeds)
  • Add a lightweight HTTP server (e.g., axum or actix-web) to the agent to serve the collected inventory as a JSON object on /.
  • Configure the agent's Cargo.toml for a static x86_64-unknown-linux-musl build.
  • Create the iPXE template for the default inventory boot (default.ipxe.tpl).
  • Create the Kickstart template (inventory.ks.tpl) that enables SSH and downloads/runs the harmony-inventory-agent binary.

3. Network Automation

  • Design a Switch trait in Harmony to abstract switch automation tasks (get_port_status, set_ports_access_mode, set_ports_lacp_mode). (I think it already exists somewhere in the domain data types)
  • Implement this Switch trait for the specific switch hardware in use (e.g., OPNSenseSwitch, ArubaSwitch, CiscoSwitch).
  • Create a SwitchAutomationScore in Harmony that uses this trait to perform the required port mode changes.

4. Provisioning Phase Implementation

  • Create the iPXE template for the role-based SCOS boot (scos-boot.ipxe.tpl).
  • Implement logic to generate the bond= and rd.bond= kernel arguments for this template based on the inventoried NICs.
  • Implement logic to generate Ignition files for each role (bootstrap, master, worker).
  • Add NMState or MachineConfig file definitions to the Ignition templates to configure the network bond persistently. (see official docs)
  • Implement the remote reboot functionality using SSH.

5. Harmony Core Orchestration

  • Design and implement the OKDInstallationScore hierarchy:
    • OKDSetup01Inventory: Manages the discovery phase. Scans the network for new agents.
    • OKDSetup02Bootstrap: Manages the transition and installation of the bootstrap node.
    • OKDSetup03ControlPlane: Manages the installation of the control plane nodes.
    • OKDSetup04Workers: Manages the installation of the worker nodes.
  • Add a new view to the Harmony TUI to display discovered, unassigned nodes.
  • (Optional, likely out of scope for now) Implement the TUI logic to allow a user to select a node and assign it a cluster role.
  • Integrate the SwitchAutomationScore into the main OKDInstallationScore to trigger network changes at the correct time.
  • Wire the entire flow together, from discovery to the final reboot for provisioning.
#### **Summary** This epic covers the work required to create a fully automated, end-to-end orchestration flow for deploying an OKD 4.19 cluster using Harmony. The primary goal is to eliminate manual, error-prone steps such as hardware inventory collection and PXE configuration. The process will leverage a two-phase approach: an initial **discovery phase** for unknown hardware and a **provisioning phase** that configures the node and network for its specific role in the cluster. This implementation will solve the critical challenge of PXE booting nodes on switch ports that are destined to be part of an LACP bond. --- #### **Background & Problem Statement** The current process for setting up a new OKD cluster involves several manual and time-consuming tasks: 1. **Manual Inventory:** Each server must be booted with a keyboard and display to manually collect hardware details (MAC addresses, NIC names, disk info, etc.). This is slow and prone to transcription errors. 2. **PXE/LACP Conflict:** Pre-configuring switch ports for an LACP bond prevents the server's UEFI/BIOS from successfully making a DHCP request, which is a prerequisite for PXE booting. This creates a chicken-and-egg problem where the host cannot get the boot instructions needed to configure the very bond that would allow it to communicate. 3. **Manual Configuration:** iPXE files and Ignition configurations have to be managed manually for each node role. This epic aims to solve these problems by making the entire process a declarative, orchestrated workflow within Harmony. --- #### **Proposed Architecture** The solution is a two-phase process orchestrated by a new `OKDInstallationScore`. **Phase 1: Automated Node Discovery (Access Mode)** 1. **Network State:** All relevant switch ports are configured as simple **access ports**. They are **not** in a bond. This guarantees that any device can successfully perform a DHCP request and PXE boot. 2. **Default PXE Boot:** A server with an unknown MAC address boots and receives the `default.ipxe` configuration from the PXE server. 3. **Live Inventory Environment:** The iPXE script boots a standard CentOS Stream 9 live environment entirely in RAM. The boot is automated by a Kickstart (`inventory.ks`) file. 4. **Inventory Agent:** The Kickstart file's `%post` script downloads and starts a purpose-built, statically-compiled Rust binary: the `harmony-inventory-agent`. This agent collects detailed hardware information and exposes it via a JSON HTTP endpoint. 5. **Harmony Discovery:** The Harmony control plane scans the provisioning network, discovers the agent's endpoint, scrapes the hardware inventory, and displays the new, unassigned node in the Harmony TUI. **Phase 2: Role-Based OKD Provisioning (LACP Mode)** 1. **User Action:** The user assigns a role (e.g., `bootstrap`, `master-0`, `worker-0`) to the discovered node in the Harmony inventory. 2. **Bootstrap the cluster:** Harmony will create the appropriate dhcp entries for hostname resolution and pxe files for the mac of the host chosen and load balancer entry as bootstrap node and then reboot the node to launch the process. - Harmony generates a MAC-specific iPXE file (`{MAC}.ipxe`) that points to the SCOS (Stream CoreOS) kernel and an Ignition file tailored for the assigned role. 3. **Initialize the control plane:** Once the bootstrap process is done, generate pxe and dhcp entries for the mac addresses of the control plane nodes from the inventory and reboot them - For each host going forward (control, plane, workers, etc), once it has booted, the bond will have to be configured on the host and the switch for future performance, redundancy and stability. This should be done through the OKD recommended method, see official docs on network customization of cluster nodes. https://docs.okd.io/latest/installing/installing_bare_metal/ipi/ipi-install-installation-workflow.html#configuring-host-dual-network-interfaces-in-the-install-config-yaml-file_ipi-install-installation-workflow 4. **Delete the bootstrap node:** Once the control plane is up and healthy, delete the bootstrap node from config and wipe the bootstrap node so it can be reused 5. **Join the worker nodes:** At this point, the cluster is up and running, time to add workers and taint the control plane nodes so they don't run workloads This architecture treats the network configuration as code and ensures that the host and network are always in a compatible state at every stage of the boot process. --- ### **To-Do List** #### **1. Initial Setup & Prerequisites** - [ ] Download CentOS Stream 9 `boot.iso` for the inventory environment. - [ ] Extract `vmlinuz` and `initrd.img` from the ISO and place them on the Caddy HTTP server under `/os/centos-stream-9/`. - [ ] Create a local mirror of the CentOS Stream 9 repository or ensure the inventory environment has internet access. - [ ] Download the required SCOS 4.19 release assets (`rhcos-live-kernel-x86_64`, `rhcos-live-initramfs.x86_64.img`, etc.) and place them on the Caddy server under `/os/scos-4.19/`. #### **2. Discovery Phase Implementation** - [x] Create a new Rust crate within the Harmony workspace named `harmony-inventory-agent`. - [ ] Implement hardware detection logic in the agent to gather: - [ ] CPU Info (model, sockets, cores, threads) - [ ] Memory Info (total, DIMM details) - [ ] Storage Info (disks, models, sizes, types) - [ ] Network Info (interface names, MAC addresses, link speeds) - [x] Add a lightweight HTTP server (e.g., `axum` or `actix-web`) to the agent to serve the collected inventory as a JSON object on `/`. - [ ] Configure the agent's `Cargo.toml` for a static `x86_64-unknown-linux-musl` build. - [ ] Create the iPXE template for the default inventory boot (`default.ipxe.tpl`). - [ ] Create the Kickstart template (`inventory.ks.tpl`) that enables SSH and downloads/runs the `harmony-inventory-agent` binary. #### **3. Network Automation** - [ ] Design a `Switch` trait in Harmony to abstract switch automation tasks (`get_port_status`, `set_ports_access_mode`, `set_ports_lacp_mode`). (I think it already exists somewhere in the domain data types) - [ ] Implement this `Switch` trait for the specific switch hardware in use (e.g., `OPNSenseSwitch`, `ArubaSwitch`, `CiscoSwitch`). - [ ] Create a `SwitchAutomationScore` in Harmony that uses this trait to perform the required port mode changes. #### **4. Provisioning Phase Implementation** - [ ] Create the iPXE template for the role-based SCOS boot (`scos-boot.ipxe.tpl`). - [ ] Implement logic to generate the `bond=` and `rd.bond=` kernel arguments for this template based on the inventoried NICs. - [ ] Implement logic to generate Ignition files for each role (`bootstrap`, `master`, `worker`). - [ ] Add `NMState` or `MachineConfig` file definitions to the Ignition templates to configure the network bond persistently. (see official docs) - [ ] Implement the remote reboot functionality using SSH. #### **5. Harmony Core Orchestration** - [ ] Design and implement the `OKDInstallationScore` hierarchy: - [ ] `OKDSetup01Inventory`: Manages the discovery phase. Scans the network for new agents. - [ ] `OKDSetup02Bootstrap`: Manages the transition and installation of the bootstrap node. - [ ] `OKDSetup03ControlPlane`: Manages the installation of the control plane nodes. - [ ] `OKDSetup04Workers`: Manages the installation of the worker nodes. - [ ] Add a new view to the Harmony TUI to display discovered, unassigned nodes. - [ ] (Optional, likely out of scope for now) Implement the TUI logic to allow a user to select a node and assign it a cluster role. - [ ] Integrate the `SwitchAutomationScore` into the main `OKDInstallationScore` to trigger network changes at the correct time. - [ ] Wire the entire flow together, from discovery to the final reboot for provisioning.
johnride self-assigned this 2025-08-19 19:35:45 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: NationTech/harmony#113
No description provided.