WIP: configure-switch #159

Closed
johnride wants to merge 18 commits from configure-switch into master
Owner
No description provided.
johnride added 3 commits 2025-09-16 18:06:36 +00:00
johnride reviewed 2025-09-16 18:11:18 +00:00
johnride left a comment
Author
Owner

LGTM

LGTM
@ -175,0 +193,4 @@
pub struct SwitchPort {
pub mac_address: MacAddress,
pub port_name: String,
// FIXME: Should we add speed as well? And other params
Author
Owner

speed yes, maybe we will need something else but nothing comes to mind right now.

speed yes, maybe we will need something else but nothing comes to mind right now.
letian marked this conversation as resolved
johnride reviewed 2025-09-17 13:31:31 +00:00
@ -77,6 +77,8 @@ harmony_secret = { path = "../harmony_secret" }
askama.workspace = true
sqlx.workspace = true
inquire.workspace = true
brocade = { version = "0.1.0", path = "../brocade" }
Author
Owner

Je ne met pas la version dans les dependances locales habituellement

Je ne met pas la version dans les dependances locales habituellement
letian marked this conversation as resolved
@ -90,2 +103,4 @@
}
fn find_master_switch(&self) -> Option<LogicalHost> {
self.switch.first().cloned() // FIXME: Should we be smarter to find the master switch?
Author
Owner

I think it is fine for now, but that is vendor specific logic. So it should probably be entirely hidden inside the brocade crate's logic.

We might have to interact with something higher level than a "switch" for a more generic API but we're not there yet.

I think it is fine for now, but that is vendor specific logic. So it should probably be entirely hidden inside the brocade crate's logic. We might have to interact with something higher level than a "switch" for a more generic API but we're not there yet.
Owner

It was easy to do so, so I pushed this logic inside the Brocade crate.

We might have to interact with something higher level than a "switch" for a more generic API but we're not there yet.

In a way, that's what we already have with the SwitchClient. Maybe something is missing indeed, but for now it does the job well (the core logic has no idea it's brocade behind the scene).

It was easy to do so, so I pushed this logic inside the Brocade crate. > We might have to interact with something higher level than a "switch" for a more generic API but we're not there yet. In a way, that's what we already have with the `SwitchClient`. Maybe something is missing indeed, but for now it does the job well (the core logic has no idea it's brocade behind the scene).
letian marked this conversation as resolved
johnride reviewed 2025-09-29 13:35:44 +00:00
johnride left a comment
Author
Owner

I did not review very deeply but still a couple comments.

I focused on the tests though and it looks good so far but I would like to see also tests for :

  • Host with multiple nics and some are disconnected
  • Multiple hosts with multiple connected nics and some disconnected (create the bonds properly)
  • I think that may not be possible at this layer but what about mlag? This belongs in the brocade specific code which I might have read the brocade code too fast.

In general I feel like this may deserve some more logging at info level. At the very least installing a cluster operator must be told to the user at the info level. I understand it is an implementation detail in the feature, but this has impact on the entire cluster administration as it deploys more resources which may cause contention on already tight clusters.

Good work so far! As you said we see the light at then end of the tunnel already.

I did not review very deeply but still a couple comments. I focused on the tests though and it looks good so far but I would like to see also tests for : - Host with multiple nics and some are disconnected - Multiple hosts with multiple connected nics and some disconnected (create the bonds properly) - I think that may not be possible at this layer but what about mlag? This belongs in the brocade specific code which I might have read the brocade code too fast. In general I feel like this may deserve some more logging at info level. At the very least installing a cluster operator must be told to the user at the info level. I understand it is an implementation detail in the feature, but this has impact on the entire cluster administration as it deploys more resources which may cause contention on already tight clusters. Good work so far! As you said we see the light at then end of the tunnel already.
@ -14,3 +14,2 @@
pub category: HostCategory,
pub network: Vec<NetworkInterface>,
pub storage: Vec<StorageDrive>,
pub network: Vec<NetworkInterface>, // FIXME: Don't use harmony_inventory_agent::NetworkInterface
Author
Owner

I understand we don't want to use the "inventory agent" data structures, but I think it's OK to have one base data structure that we try to reuse across crates for basic components like NetworkInterface and these few others.

IMO, the FIXME should be about refactoring the type into a shared types crate or something like that.

Why would that not be correct?

I understand we don't want to use the "inventory agent" data structures, but I think it's OK to have one base data structure that we try to reuse across crates for basic components like NetworkInterface and these few others. IMO, the FIXME should be about refactoring the type into a shared types crate or something like that. Why would that not be correct?
Owner

that's one of the options, yes 😉 the fixme is just stating the problem: we should not use directly this type from harmony_inventory_agent

possible solutions are:

  1. (what you suggested) move it to harmony_types and both harmony_inventory_agent and harmony uses it
  2. keep the harmony_inventory_agent::NetworkInterface as is and have a different structure in harmony and map from one to the other

As of now I'm personally 50/50: I have a feeling that the needs from harmony_inventory_agent might evolve differently from harmony thus the NetworkInterface might evolve differently. So going with option 1. wouldn't be adequate. But on the other hand for now they are indeed very similar so why maintain 2 structs? 🤷

Anyway for now it's not really a big issue and won't be hard to fix. So it's more like a note for the future to be aware of.

that's one of the options, yes 😉 the `fixme` is just stating the problem: we should not use directly this type from `harmony_inventory_agent` possible solutions are: 1. (what you suggested) move it to `harmony_types` and both `harmony_inventory_agent` and `harmony` uses it 2. keep the `harmony_inventory_agent::NetworkInterface` as is and have a different structure in `harmony` and map from one to the other As of now I'm personally 50/50: I have a feeling that the needs from `harmony_inventory_agent` might evolve differently from `harmony` thus the `NetworkInterface` might evolve differently. So going with option 1. wouldn't be adequate. But on the other hand for now they are indeed very similar so why maintain 2 structs? 🤷 Anyway for now it's not really a big issue and won't be hard to fix. So it's more like a note for the future to be aware of.
@ -92,0 +156,4 @@
channel: Some("stable".to_string()),
install_plan_approval: Some(InstallPlanApproval::Automatic),
name: "kubernetes-nmstate-operator".to_string(),
source: "redhat-operators".to_string(),
Author
Owner

redhat operators are not always available, it depends on how the OKD cluster was brought up. This method may not be as portable as we need.

I think we can move forward for now but this deserves a comment here in the code to help debug potential issues.

redhat operators are not always available, it depends on how the OKD cluster was brought up. This method may not be as portable as we need. I think we can move forward for now but this deserves a comment here in the code to help debug potential issues.
@ -92,0 +183,4 @@
}
fn get_next_bond_id(&self) -> u8 {
42 // FIXME: Find a better way to declare the bond id
Author
Owner

There may be a better way but no better answer than 42.

There may be a better way but no better answer than 42.
@ -216,0 +215,4 @@
) -> Result<(), InterpretError> {
info!("[ControlPlane] Ensuring persistent bonding");
let score = HostNetworkConfigurationScore {
hosts: hosts.clone(), // FIXME: Avoid clone if possible
Author
Owner

The clone has been bugging me too. We could use an Rc or Arc too as a balance between usability and performance.

The clone has been bugging me too. We could use an Rc or Arc too as a balance between usability and performance.
@ -0,0 +236,4 @@
#[tokio::test]
async fn port_not_found_for_mac_address_should_not_configure_interface() {
// FIXME: Should it still configure an empty bond/port channel?
Author
Owner

If I understand correctly this has the correct behavior. It is perfectly normal for a host to have mac addresses that are not connected to the switch. Most of them do have disconnected 1gb interfaces.

It is also normal to find connected interfaces in the switch that are not part of the cluster. If an admin connects to the switch for debugging, or the switch is used for other purposes than only this cluster.

If I understand correctly this has the correct behavior. It is perfectly normal for a host to have mac addresses that are not connected to the switch. Most of them do have disconnected 1gb interfaces. It is also normal to find connected interfaces in the switch that are not part of the cluster. If an admin connects to the switch for debugging, or the switch is used for other purposes than only this cluster.
Owner

in that case what should we do if we realize that not a single mac addresses were found on the switch? should we simply ignore it (current solution) or should we clear existing configurations to a clean slate?

and actually the question will be similar when we'll approach the "update config" scenario: what to do with existing bonds? and port-channels? should we just clear everything before and just apply the new config? or a smart way to update existing bonds/port-channels?

in that case what should we do if we realize that not a single mac addresses were found on the switch? should we simply ignore it (current solution) or should we clear existing configurations to a clean slate? and actually the question will be similar when we'll approach the "update config" scenario: what to do with existing bonds? and port-channels? should we just clear everything before and just apply the new config? or a smart way to update existing bonds/port-channels?
Author
Owner

Ok, I think I figured it out and it's not as complicated as it seems.

The problem is not only specific replacement logic, it is also idempotency. Does the current code currently support detecting that the mac address is connected to a port that is already in a port channel with the correct configuration (correct mac + ports assigned to it) ?

I think the correct behavior here is :

  • You have a list of mac addresses that can be bonded together (the entire list for a single host)
  • Find all the port numbers in the switch stack that match the host
  • Check if any of those are already in a port channel
  • If more than one port channel found, crash with a clear error message that the status of the switch configuration is not supported with the actual, wrong config clearly outlined.
  • Ensure the port channel ends up with the correct config (all found ports and only found ports, no foreign port allowed)

This behavior should be quite easily testable too 😁

Ok, I think I figured it out and it's not as complicated as it seems. The problem is not only specific replacement logic, it is also idempotency. Does the current code currently support detecting that the mac address is connected to a port that is already in a port channel with the correct configuration (correct mac + ports assigned to it) ? I think the correct behavior here is : - You have a list of mac addresses that can be bonded together (the entire list for a single host) - Find all the port numbers in the switch stack that match the host - Check if any of those are already in a port channel - If more than one port channel found, crash with a clear error message that the status of the switch configuration is not supported with the actual, wrong config clearly outlined. - Ensure the port channel ends up with the correct config (all found ports and only found ports, no foreign port allowed) This behavior should be quite easily testable too 😁
letian added 1 commit 2025-10-08 01:27:54 +00:00
refactor brocade to support different shell versions (e.g. FastIron vs NOS)
Some checks failed
Run Check Script / check (pull_request) Failing after 1m17s
ad61be277b
letian force-pushed configure-switch from ad61be277b to 77e09436a9 2025-10-08 12:14:44 +00:00 Compare
letian added 1 commit 2025-10-08 14:36:09 +00:00
improve wait for shell ready, restore dry run mode
All checks were successful
Run Check Script / check (pull_request) Successful in 1m12s
073cccde2f
letian added 1 commit 2025-10-08 21:05:30 +00:00
get the stack topology (inter-switch links) and all interfaces
All checks were successful
Run Check Script / check (pull_request) Successful in 1m13s
1265cebfa7
letian added 1 commit 2025-10-14 21:21:19 +00:00
implement switch brocade setup by configuring interfaces with operating mode access
All checks were successful
Run Check Script / check (pull_request) Successful in 1m6s
da5be17cb6
letian added 1 commit 2025-10-15 16:13:15 +00:00
implement port channel functions for Brocade::NetworkOperatingSystem
All checks were successful
Run Check Script / check (pull_request) Successful in 1m5s
2edd24753a
letian added 1 commit 2025-10-15 18:50:24 +00:00
Merge branch 'master' into configure-switch
All checks were successful
Run Check Script / check (pull_request) Successful in 1m6s
ea7322f38c
letian closed this pull request 2025-10-15 20:01:01 +00:00
All checks were successful
Run Check Script / check (pull_request) Successful in 1m6s

Pull request closed

Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: NationTech/harmony#159
No description provided.