adr: Staless based failover mechanism ADR proposed

2026-01-08 23:58:30 -05:00
13 changed files with 232 additions and 593 deletions
--- a/adr/017-1-Nats-Clusters-Interconnection-Topology.md
+++ b/adr/017-1-Nats-Clusters-Interconnection-Topology.md
@@ -1,189 +0,0 @@
-### 1. ADR 017-1: NATS Cluster Interconnection & Trust Topology
-
-# Architecture Decision Record: NATS Cluster Interconnection & Trust Topology
-
-**Status:** Proposed
-**Date:** 2026-01-12
-**Precedes:** [017-Staleness-Detection-for-Failover.md]
-
-## Context
-
-In ADR 017, we defined the failover mechanisms for the Harmony mesh. However, for a Primary (Site A) and a Replica (Site B) to communicate securely—or for the Global Mesh to function across disparate locations—we must establish a robust Transport Layer Security (TLS) strategy.
-
-Our primary deployment platform is OKD (Kubernetes). While OKD provides an internal `service-ca`, it is designed primarily for intra-cluster service-to-service communication. It lacks the flexibility required for:
-1.  **Public/External Gateway Identities:** NATS Gateways need to identify themselves via public DNS names or external IPs, not just internal `.svc` cluster domains.
-2.  **Cross-Cluster Trust:** We need a mechanism to allow Cluster A to trust Cluster B without sharing a single private root key.
-
-## Decision
-
-We will implement an **"Islands of Trust"** topology using **cert-manager** on OKD.
-
-### 1. Per-Cluster Certificate Authorities (CA)
-
-* We explicitly **reject** the use of a single "Supercluster CA" shared across all sites.
-    * Instead, every Harmony Cluster (Site A, Site B, etc.) will generate its own unique Self-Signed Root CA managed by `cert-manager` inside that cluster.
-*   **Lifecycle:** Root CAs will have a long duration (e.g., 10 years) to minimize rotation friction, while Leaf Certificates (NATS servers) will remain short-lived (e.g., 90 days) and rotate automatically.
-
-> Note : The decision to have a single CA for various workloads managed by Harmony on each deployment, or to have multiple CA for each service that requires interconnection is not made yet. This ADR leans towards one CA per service. This allows for maximum flexibility. But the direction might change and no clear decision has been made yet. The alternative of establishing that each cluster/harmony deployment has a single identity could make mTLS very simple between tenants.
-
-### 2. Trust Federation via Bundle Exchange
-
-To enable secure communication (mTLS) between clusters (e.g., for NATS Gateways or Leaf Nodes):
-
-*   **No Private Keys are shared.**
-*   We will aggregate the **Public CA Certificates** of all trusted clusters into a shared `ca-bundle.pem`.
-*   This bundle is distributed to the NATS configuration of every node.
-*   **Verification Logic:** When Site A connects to Site B, Site A verifies Site B's certificate against the bundle. Since Site B's CA public key is in the bundle, the connection is accepted.
-
-### 3. Tooling
-
-* We will use **cert-manager** (deployed via Operator on OKD) rather than OKD's built-in `service-ca`. This provides us with standard CRDs (`Issuer`, `Certificate`) to manage the lifecycle, rotation, and complex SANs (Subject Alternative Names) required for external connectivity.
-* Harmony will manage installation, configuration and bundle creation across all sites
-
-## Rationale
-
-**Security Blast Radius (The "Key Leak" Scenario)**
-If we used a single global CA and the private key for Site A was compromised (e.g., physical theft of a server from a basement), the attacker could impersonate *any* site in the global mesh.
-By using Per-Cluster CAs:
-*   If Site A is compromised, only Site A's identity is stolen.
-*   We can "evict" Site A from the mesh simply by removing Site A's Public CA from the `ca-bundle.pem` on the remaining healthy clusters and reloading. The attacker can no longer authenticate.
-
-**Decentralized Autonomy**
-This aligns with the "Humane Computing" vision. A local cluster owns its identity. It does not depend on a central authority to issue its certificates. It can function in isolation (offline) indefinitely without needing to "phone home" to renew credentials.
-
-## Consequences
-
-**Positive**
-*   **High Security:** Compromise of one node does not compromise the global mesh.
-*   **Flexibility:** Easier to integrate with third-party clusters or partners by simply adding their public CA to the bundle.
-*   **Standardization:** `cert-manager` is the industry standard, making the configuration portable to non-OKD K8s clusters if needed.
-
-**Negative**
-*   **Configuration Complexity:** We must manage a mechanism to distribute the `ca-bundle.pem` containing public keys to all sites. This should be automated (e.g., via a Harmony Agent) to ensure timely updates and revocation.
-*   **Revocation Latency:** Revoking a compromised cluster requires updating and reloading the bundle on all other clusters. This is slower than OCSP/CRL but acceptable for infrastructure-level trust if automation is in place.
-
---
-
-# 2. Concrete overview of the process, how it can be implemented manually across multiple OKD clusters
-
-All of this will be automated via Harmony, but to understand correctly the process it is outlined in details here :
-
-## 1. Deploying and Configuring cert-manager on OKD
-
-While OKD has a built-in `service-ca` controller, it is "opinionated" and primarily signs certs for internal services (like `my-svc.my-namespace.svc`). It is **not suitable** for the Harmony Global Mesh because you cannot easily control the Subject Alternative Names (SANs) for external routes (e.g., `nats.site-a.nationtech.io`), nor can you easily export its CA to other clusters.
-
-**The Solution:** Use the **cert-manager Operator for Red Hat OpenShift**.
-
-### Step 1: Install the Operator
-1.  Log in to the OKD Web Console.
-2.  Navigate to **Operators** -> **OperatorHub**.
-3.  Search for **"cert-manager"**.
-4.  Choose the **"cert-manager Operator for Red Hat OpenShift"** (Red Hat provided) or the community version.
-5.  Click **Install**. Use the default settings (Namespace: `cert-manager-operator`).
-
-### Step 2: Create the "Island" CA (The Issuer)
-Once installed, you define your cluster's unique identity. Apply this YAML to your NATS namespace.
-
-```yaml
-# filepath: k8s/01-issuer.yaml
-apiVersion: cert-manager.io/v1
-kind: Issuer
-metadata:
-  name: harmony-selfsigned-issuer
-  namespace: harmony-nats
-spec:
-  selfSigned: {}
---
-# This generates the unique Root CA for THIS specific cluster
-apiVersion: cert-manager.io/v1
-kind: Certificate
-metadata:
-  name: harmony-root-ca
-  namespace: harmony-nats
-spec:
-  isCA: true
-  commonName: "harmony-site-a-ca" # CHANGE THIS per cluster (e.g., site-b-ca)
-  duration: 87600h # 10 years
-  renewBefore: 2160h # 3 months before expiry
-  secretName: harmony-root-ca-secret
-  privateKey:
-    algorithm: ECDSA
-    size: 256
-  issuerRef:
-    name: harmony-selfsigned-issuer
-    kind: Issuer
-    group: cert-manager.io
---
-# This Issuer uses the Root CA generated above to sign NATS certs
-apiVersion: cert-manager.io/v1
-kind: Issuer
-metadata:
-  name: harmony-ca-issuer
-  namespace: harmony-nats
-spec:
-  ca:
-    secretName: harmony-root-ca-secret
-```
-
-### Step 3: Generate the NATS Server Certificate
-This certificate will be used by the NATS server. It includes both internal DNS names (for local clients) and external DNS names (for the global mesh).
-
-```yaml
-# filepath: k8s/02-nats-cert.yaml
-apiVersion: cert-manager.io/v1
-kind: Certificate
-metadata:
-  name: nats-server-cert
-  namespace: harmony-nats
-spec:
-  secretName: nats-server-tls
-  duration: 2160h # 90 days
-  renewBefore: 360h # 15 days
-  issuerRef:
-    name: harmony-ca-issuer
-    kind: Issuer
-  # CRITICAL: Define all names this server can be reached by
-  dnsNames:
-  - "nats"
-  - "nats.harmony-nats.svc"
-  - "nats.harmony-nats.svc.cluster.local"
-  - "*.nats.harmony-nats.svc.cluster.local"
-  - "nats-gateway.site-a.nationtech.io" # External Route for Mesh
-```
-
-## 2. Implementing the "Islands of Trust" (Trust Bundle)
-
-To make Site A and Site B talk, you need to exchange **Public Keys**.
-
-1.  **Extract Public CA from Site A:**
-    ```bash
-    oc get secret harmony-root-ca-secret -n harmony-nats -o jsonpath='{.data.ca\.crt}' | base64 -d > site-a.crt
-    ```
-2.  **Extract Public CA from Site B:**
-    ```bash
-    oc get secret harmony-root-ca-secret -n harmony-nats -o jsonpath='{.data.ca\.crt}' | base64 -d > site-b.crt
-    ```
-3.  **Create the Bundle:**
-    Combine them into one file.
-    ```bash
-    cat site-a.crt site-b.crt > ca-bundle.crt
-    ```
-4.  **Upload Bundle to Both Clusters:**
-    Create a ConfigMap or Secret in *both* clusters containing this combined bundle.
-    ```bash
-    oc create configmap nats-trust-bundle --from-file=ca.crt=ca-bundle.crt -n harmony-nats
-    ```
-5.  **Configure NATS:**
-    Mount this ConfigMap and point NATS to it.
-
-    ```conf
-    # nats.conf snippet
-    tls {
-      cert_file: "/etc/nats-certs/tls.crt"
-      key_file:  "/etc/nats-certs/tls.key"
-      # Point to the bundle containing BOTH Site A and Site B public CAs
-      ca_file:   "/etc/nats-trust/ca.crt"
-    }
-    ```
-
-This setup ensures that Site A can verify Site B's certificate (signed by `harmony-site-b-ca`) because Site B's CA is in Site A's trust store, and vice versa, without ever sharing the private keys that generated them.
--- a/adr/017-Staleness-based-failover-mechanism-and-observability.md
+++ b/adr/017-Staleness-based-failover-mechanism-and-observability.md
@@ -0,0 +1,95 @@
+# Architecture Decision Record: Staleness-Based Failover Mechanism & Observability
+
+**Status:** Proposed
+**Date:** 2026-01-09
+**Precedes:** [016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md](https://git.nationtech.io/NationTech/harmony/raw/branch/master/adr/016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md)
+
+## Context
+
+In ADR 016, we established the **Harmony Agent** and the **Global Orchestration Mesh** (powered by NATS JetStream) as the foundation for our decentralized infrastructure. We defined the high-level need for a `FailoverStrategy` that can support both financial consistency (CP) and AI availability (AP).
+
+However, a specific implementation challenge remains: **How do we reliably detect node failure without losing the ability to debug the event later?**
+
+Standard distributed systems often use "Key Expiration" (TTL) for heartbeats. If a key disappears, the node is presumed dead. While simple, this approach is catastrophic for post-mortem analysis. When the key expires, the evidence of *when* and *how* the failure occurred evaporates.
+
+For NationTech’s vision of **Humane Computing**—where micro datacenters might be heating a family home or running a local business—reliability and diagnosability are paramount. If a cluster fails over, we owe it to the user to provide a clear, historical log of exactly what happened. We cannot build a "wonderful future for computers" on ephemeral, untraceable errors.
+
+## Decision
+
+We will implement a **Staleness Detection** mechanism rather than a Key Expiration mechanism. We will leverage NATS JetStream Key-Value (KV) stores with **History Enabled** to create an immutable audit trail of cluster health.
+
+### 1. The "Black Box" Flight Recorder (NATS Configuration)
+We will utilize a persistent NATS KV bucket named `harmony_failover`.
+*   **Storage:** File (Persistent).
+*   **History:** Set to `64` (or higher). This allows us to query the last 64 heartbeat entries to visualize the exact degradation of the primary node before failure.
+*   **TTL:** None. Data never disappears; it only becomes "stale."
+
+### 2. Data Structures
+We will define two primary schemas to manage the state.
+
+
+**A. The Rules of Engagement (`cluster_config`)**
+This persistent key defines the behavior of the mesh. It allows us to tune failover sensitivity dynamically without redeploying the Agent binary.
+
+```json
+{
+  "primary_site_id": "site-a-basement",
+  "replica_site_id": "site-b-cloud",
+  "failover_timeout_ms": 5000,   // Time before Replica takes over
+  "heartbeat_interval_ms": 1000  // Frequency of Primary updates
+}
+```
+
+> **Note :** The location for this configuration data structure is TBD. See https://git.nationtech.io/NationTech/harmony/issues/206
+
+**B. The Heartbeat (`primary_heartbeat`)**
+The Primary writes this; the Replica watches it.
+
+```json
+{
+  "site_id": "site-a-basement",
+  "status": "HEALTHY",
+  "counter": 10452,
+  "timestamp": 1704661549000
+}
+```
+
+### 3. The Failover Algorithm
+
+**The Primary (Site A) Logic:**
+The Primary's ability to write to the mesh is its "License to Operate."
+1.  **Write Loop:** Attempts to write `primary_heartbeat` every `heartbeat_interval_ms`.
+2.  **Self-Preservation (Fencing):** If the write fails (NATS Ack timeout or NATS unreachable), the Primary **immediately self-demotes**. It assumes it is network-isolated. This prevents Split Brain scenarios where a partitioned Primary continues to accept writes while the Replica promotes itself.
+
+**The Replica (Site B) Logic:**
+The Replica acts as the watchdog.
+1.  **Watch:** Subscribes to updates on `primary_heartbeat`.
+2.  **Staleness Check:** Maintains a local timer. Every time a heartbeat arrives, the timer resets.
+3.  **Promotion:** If the timer exceeds `failover_timeout_ms`, the Replica declares the Primary dead and promotes itself to Leader.
+4.  **Yielding:** If the Replica is Leader, but suddenly receives a valid, new heartbeat from the configured `primary_site_id` (indicating the Primary has recovered), the Replica will voluntarily **demote** itself to restore the preferred topology.
+
+## Rationale
+
+**Observability as a First-Class Citizen**
+By keeping the last 64 heartbeats, we can run `nats kv history` to see the exact timeline. Did the Primary stop suddenly (crash)? or did the heartbeats become erratic and slow before stopping (network congestion)? This data is critical for optimizing the "Micro Data Centers" described in our vision, where internet connections in residential areas may vary in quality.
+
+**Energy Efficiency & Resource Optimization**
+NationTech aims to "maximize the value of our energy." A "flapping" cluster (constantly failing over and back) wastes immense energy in data re-synchronization and startup costs. By making the `failover_timeout_ms` configurable via `cluster_config`, we can tune a cluster heating a greenhouse to be less sensitive (slower failover is fine) compared to a cluster running a payment gateway.
+
+**Decentralized Trust**
+This architecture relies on NATS as the consensus engine. If the Primary is part of the NATS majority, it lives. If it isn't, it dies. This removes ambiguity and allows us to scale to thousands of independent sites without a central "God mode" controller managing every single failover.
+
+## Consequences
+
+**Positive**
+*   **Auditability:** Every failover event leaves a permanent trace in the KV history.
+*   **Safety:** The "Write Ack" check on the Primary provides a strong guarantee against Split Brain in `AbsoluteConsistency` mode.
+*   **Dynamic Tuning:** We can adjust timeouts for specific environments (e.g., high-latency satellite links) by updating a JSON key, requiring no downtime.
+
+**Negative**
+*   **Storage Overhead:** Keeping history requires marginally more disk space on the NATS servers, though for 64 small JSON payloads, this is negligible.
+*   **Clock Skew:** While we rely on NATS server-side timestamps for ordering, extreme clock skew on the client side could confuse the debug logs (though not the failover logic itself).
+
+## Alignment with Vision
+This architecture supports the NationTech goal of a **"Beautifully Integrated Design."** It takes the complex, high-stakes problem of distributed consensus and wraps it in a mechanism that is robust enough for enterprise banking yet flexible enough to manage a basement server heating a swimming pool. It bridges the gap between the reliability of Web2 clouds and the decentralized nature of Web3 infrastructure.
+```
--- a/examples/nats/src/main.rs
+++ b/examples/nats/src/main.rs
@@ -3,58 +3,15 @@ use std::str::FromStr;
 use harmony::{
    inventory::Inventory,
    modules::helm::chart::{HelmChartScore, HelmRepository, NonBlankString},
-    topology::{HelmCommand, K8sAnywhereConfig, K8sAnywhereTopology, TlsRouter, Topology},
+    topology::K8sAnywhereTopology,
 };
 use harmony_macros::hurl;
 use log::info;

 #[tokio::main]
 async fn main() {
-    let site1_topo = K8sAnywhereTopology::with_config(K8sAnywhereConfig::remote_k8s_from_env_var(
-        "HARMONY_NATS_SITE_1",
-    ));
-    let site2_topo = K8sAnywhereTopology::with_config(K8sAnywhereConfig::remote_k8s_from_env_var(
-        "HARMONY_NATS_SITE_2",
-    ));
-
-    let site1_domain = site1_topo.get_internal_domain().await.unwrap().unwrap();
-    let site2_domain = site2_topo.get_internal_domain().await.unwrap().unwrap();
-
-    let site1_gateway = format!("nats-gateway.{}", site1_domain);
-    let site2_gateway = format!("nats-gateway.{}", site2_domain);
-
-    tokio::join!(
-        deploy_nats(
-            site1_topo,
-            "site-1",
-            vec![("site-2".to_string(), site2_gateway)]
-        ),
-        deploy_nats(
-            site2_topo,
-            "site-2",
-            vec![("site-1".to_string(), site1_gateway)]
-        ),
-    );
-}
-
-async fn deploy_nats<T: Topology + HelmCommand + TlsRouter + 'static>(
-    topology: T,
-    cluster_name: &str,
-    remote_gateways: Vec<(String, String)>,
-) {
-    topology.ensure_ready().await.unwrap();
-
-    let mut gateway_gateways = String::new();
-    for (name, url) in remote_gateways {
-        gateway_gateways.push_str(&format!(
-            r#"
-      - name: {name}
-        urls:
-          - nats://{url}:7222"#
-        ));
-    }
-
-    let values_yaml = Some(format!(
+    // env_logger::init();
+    let values_yaml = Some(
        r#"config:
  cluster:
    enabled: true
@@ -68,31 +25,16 @@ async fn deploy_nats<T: Topology + HelmCommand + TlsRouter + 'static>(
  leafnodes:
    enabled: false
    # port: 7422
-  websocket:
-    enabled: true
-    ingress:
-      enabled: true
-      className: openshift-default
-      pathType: Prefix
-      hosts: 
-        - nats-ws.{}
  gateway:
-    enabled: true
-    name: {}
-    port: 7222
-    gateways: {}
-service:
-  ports:
-    gateway:
-      enabled: true
+    enabled: false
+    # name: my-gateway
+    # port: 7522
 natsBox:
  container:
    image:
-      tag: nonroot"#,
-        topology.get_internal_domain().await.unwrap().unwrap(),
-        cluster_name,
-        gateway_gateways,
-    ));
+      tag: nonroot"#
+            .to_string(),
+    );
    let namespace = "nats";
    let nats = HelmChartScore {
        namespace: Some(NonBlankString::from_str(namespace).unwrap()),
@@ -110,9 +52,14 @@ natsBox:
        )),
    };

-    harmony_cli::run(Inventory::autoload(), topology, vec![Box::new(nats)], None)
-        .await
-        .unwrap();
+    harmony_cli::run(
+        Inventory::autoload(),
+        K8sAnywhereTopology::from_env(),
+        vec![Box::new(nats)],
+        None,
+    )
+    .await
+    .unwrap();

    info!(
        "Enjoy! You can test your nats cluster by running : `kubectl exec -n {namespace} -it deployment/nats-box -- nats pub test hi`"
--- a/harmony/src/domain/topology/helm_command.rs
+++ b/harmony/src/domain/topology/helm_command.rs
@@ -1,5 +1 @@
-use std::process::Command;
-
-pub trait HelmCommand {
-    fn get_helm_command(&self) -> Command;
-}
+pub trait HelmCommand {}
--- a/harmony/src/domain/topology/k8s_anywhere/k8s_anywhere.rs
+++ b/harmony/src/domain/topology/k8s_anywhere/k8s_anywhere.rs
@@ -35,7 +35,6 @@ use crate::{
                service_monitor::ServiceMonitor,
            },
        },
-        okd::crd::ingresses_config::Ingress as IngressResource,
        okd::route::OKDTlsPassthroughScore,
        prometheus::{
            k8s_prometheus_alerting_score::K8sPrometheusCRDAlertingScore,
@@ -108,32 +107,8 @@ impl K8sclient for K8sAnywhereTopology {

 #[async_trait]
 impl TlsRouter for K8sAnywhereTopology {
-    async fn get_internal_domain(&self) -> Result<Option<String>, String> {
-        match self.get_k8s_distribution().await.map_err(|e| {
-            format!(
-                "Could not get internal domain, error getting k8s distribution : {}",
-                e.to_string()
-            )
-        })? {
-            KubernetesDistribution::OpenshiftFamily => {
-                let client = self.k8s_client().await?;
-                if let Some(ingress_config) = client
-                    .get_resource::<IngressResource>("cluster", None)
-                    .await
-                    .map_err(|e| {
-                        format!("Error attempting to get ingress config : {}", e.to_string())
-                    })?
-                {
-                    debug!("Found ingress config {:?}", ingress_config.spec);
-                    Ok(ingress_config.spec.domain.clone())
-                } else {
-                    warn!("Could not find a domain configured in this cluster");
-                    Ok(None)
-                }
-            }
-            KubernetesDistribution::K3sFamily => todo!(),
-            KubernetesDistribution::Default => todo!(),
-        }
+    async fn get_wildcard_domain(&self) -> Result<Option<String>, String> {
+        todo!()
    }

    /// Returns the port that this router exposes externally.
@@ -1112,21 +1087,7 @@ impl MultiTargetTopology for K8sAnywhereTopology {
    }
 }

-impl HelmCommand for K8sAnywhereTopology {
-    fn get_helm_command(&self) -> Command {
-        let mut cmd = Command::new("helm");
-        if let Some(k) = &self.config.kubeconfig {
-            cmd.args(["--kubeconfig", k]);
-        }
-
-        if let Some(c) = &self.config.k8s_context {
-            cmd.args(["--kube-context", c]);
-        }
-
-        info!("Using helm command {cmd:?}");
-        cmd
-    }
-}
+impl HelmCommand for K8sAnywhereTopology {}

 #[async_trait]
 impl TenantManager for K8sAnywhereTopology {
@@ -1147,7 +1108,7 @@ impl TenantManager for K8sAnywhereTopology {
 #[async_trait]
 impl Ingress for K8sAnywhereTopology {
    async fn get_domain(&self, service: &str) -> Result<String, PreparationError> {
-        use log::{trace, warn};
+        use log::{debug, trace, warn};

        let client = self.k8s_client().await?;

--- a/harmony/src/domain/topology/localhost.rs
+++ b/harmony/src/domain/topology/localhost.rs
@@ -2,7 +2,7 @@ use async_trait::async_trait;
 use derive_new::new;
 use serde::{Deserialize, Serialize};

-use super::{PreparationError, PreparationOutcome, Topology};
+use super::{HelmCommand, PreparationError, PreparationOutcome, Topology};

 #[derive(new, Clone, Debug, Serialize, Deserialize)]
 pub struct LocalhostTopology;
@@ -19,3 +19,6 @@ impl Topology for LocalhostTopology {
        })
    }
 }
+
+// TODO: Delete this, temp for test
+impl HelmCommand for LocalhostTopology {}
--- a/harmony/src/domain/topology/router.rs
+++ b/harmony/src/domain/topology/router.rs
@@ -112,13 +112,12 @@ pub trait TlsRouter: Send + Sync {
    ///          HAProxy frontend→backend \"postgres-upstream\".
    async fn install_route(&self, config: TlsRoute) -> Result<(), String>;

-    /// Gets the base domain of this cluster. On openshift family clusters, this is the domain
-    /// used by default for all components, including the default ingress controller that
-    /// transforms ingress to routes.
+    /// Gets the base domain that can be used to deploy applications that will be automatically
+    /// routed to this cluster.
    ///
-    /// For example, get_internal_domain on a cluster that has `console-openshift-console.apps.mycluster.something`
-    /// will return `apps.mycluster.something`
-    async fn get_internal_domain(&self) -> Result<Option<String>, String>;
+    /// For example, if we have *.apps.nationtech.io pointing to a public load balancer, then this
+    /// function would install route apps.nationtech.io
+    async fn get_wildcard_domain(&self) -> Result<Option<String>, String>;

    /// Returns the port that this router exposes externally.
    async fn get_router_port(&self) -> u16;
--- a/harmony/src/modules/helm/chart.rs
+++ b/harmony/src/modules/helm/chart.rs
@@ -6,11 +6,15 @@ use crate::topology::{HelmCommand, Topology};
 use async_trait::async_trait;
 use harmony_types::id::Id;
 use harmony_types::net::Url;
+use helm_wrapper_rs;
+use helm_wrapper_rs::blocking::{DefaultHelmExecutor, HelmExecutor};
 use log::{debug, info, warn};
 pub use non_blank_string_rs::NonBlankString;
 use serde::Serialize;
 use std::collections::HashMap;
-use std::process::{Output, Stdio};
+use std::path::Path;
+use std::process::{Command, Output, Stdio};
+use std::str::FromStr;
 use temp_file::TempFile;

 #[derive(Debug, Clone, Serialize)]
@@ -61,7 +65,7 @@ pub struct HelmChartInterpret {
    pub score: HelmChartScore,
 }
 impl HelmChartInterpret {
-    fn add_repo<T: HelmCommand>(&self, topology: &T) -> Result<(), InterpretError> {
+    fn add_repo(&self) -> Result<(), InterpretError> {
        let repo = match &self.score.repository {
            Some(repo) => repo,
            None => {
@@ -80,7 +84,7 @@ impl HelmChartInterpret {
            add_args.push("--force-update");
        }

-        let add_output = run_helm_command(topology, &add_args)?;
+        let add_output = run_helm_command(&add_args)?;
        let full_output = format!(
            "{}\n{}",
            String::from_utf8_lossy(&add_output.stdout),
@@ -96,19 +100,23 @@ impl HelmChartInterpret {
    }
 }

-fn run_helm_command<T: HelmCommand>(topology: &T, args: &[&str]) -> Result<Output, InterpretError> {
-    let mut helm_cmd = topology.get_helm_command();
-    helm_cmd.args(args);
+fn run_helm_command(args: &[&str]) -> Result<Output, InterpretError> {
+    let command_str = format!("helm {}", args.join(" "));
+    debug!(
+        "Got KUBECONFIG: `{}`",
+        std::env::var("KUBECONFIG").unwrap_or("".to_string())
+    );
+    debug!("Running Helm command: `{}`", command_str);

-    debug!("Running Helm command: `{:?}`", helm_cmd);
-
-    let output = helm_cmd
+    let output = Command::new("helm")
+        .args(args)
        .stdout(Stdio::piped())
        .stderr(Stdio::piped())
        .output()
        .map_err(|e| {
            InterpretError::new(format!(
-                "Failed to execute helm command '{helm_cmd:?}': {e}. Is helm installed and in PATH?",
+                "Failed to execute helm command '{}': {}. Is helm installed and in PATH?",
+                command_str, e
            ))
        })?;

@@ -116,13 +124,13 @@ fn run_helm_command<T: HelmCommand>(topology: &T, args: &[&str]) -> Result<Outpu
        let stdout = String::from_utf8_lossy(&output.stdout);
        let stderr = String::from_utf8_lossy(&output.stderr);
        warn!(
-            "Helm command `{helm_cmd:?}` failed with status: {}\nStdout:\n{stdout}\nStderr:\n{stderr}",
-            output.status
+            "Helm command `{}` failed with status: {}\nStdout:\n{}\nStderr:\n{}",
+            command_str, output.status, stdout, stderr
        );
    } else {
        debug!(
-            "Helm command `{helm_cmd:?}` finished successfully. Status: {}",
-            output.status
+            "Helm command `{}` finished successfully. Status: {}",
+            command_str, output.status
        );
    }

@@ -134,7 +142,7 @@ impl<T: Topology + HelmCommand> Interpret<T> for HelmChartInterpret {
    async fn execute(
        &self,
        _inventory: &Inventory,
-        topology: &T,
+        _topology: &T,
    ) -> Result<Outcome, InterpretError> {
        let ns = self
            .score
@@ -142,62 +150,98 @@ impl<T: Topology + HelmCommand> Interpret<T> for HelmChartInterpret {
            .as_ref()
            .unwrap_or_else(|| todo!("Get namespace from active kubernetes cluster"));

-        self.add_repo(topology)?;
-
-        let mut args = if self.score.install_only {
-            vec!["install"]
-        } else {
-            vec!["upgrade", "--install"]
+        let tf: TempFile;
+        let yaml_path: Option<&Path> = match self.score.values_yaml.as_ref() {
+            Some(yaml_str) => {
+                tf = temp_file::with_contents(yaml_str.as_bytes());
+                debug!(
+                    "values yaml string for chart {} :\n {yaml_str}",
+                    self.score.chart_name
+                );
+                Some(tf.path())
+            }
+            None => None,
        };

-        args.extend(vec![
-            &self.score.release_name,
-            &self.score.chart_name,
-            "--namespace",
-            &ns,
-        ]);
+        self.add_repo()?;

+        let helm_executor = DefaultHelmExecutor::new_with_opts(
+            &NonBlankString::from_str("helm").unwrap(),
+            None,
+            900,
+            false,
+            false,
+        );
+
+        let mut helm_options = Vec::new();
        if self.score.create_namespace {
-            args.push("--create-namespace");
+            helm_options.push(NonBlankString::from_str("--create-namespace").unwrap());
        }

-        if let Some(version) = &self.score.chart_version {
-            args.push("--version");
-            args.push(&version);
-        }
+        if self.score.install_only {
+            let chart_list = match helm_executor.list(Some(ns)) {
+                Ok(charts) => charts,
+                Err(e) => {
+                    return Err(InterpretError::new(format!(
+                        "Failed to list scores in namespace {:?} because of error : {}",
+                        self.score.namespace, e
+                    )));
+                }
+            };

-        let tf: TempFile;
-        if let Some(yaml_str) = &self.score.values_yaml {
-            tf = temp_file::with_contents(yaml_str.as_bytes());
-            args.push("--values");
-            args.push(tf.path().to_str().unwrap());
-        }
-
-        let overrides_strings: Vec<String>;
-        if let Some(overrides) = &self.score.values_overrides {
-            overrides_strings = overrides
+            if chart_list
                .iter()
-                .map(|(key, value)| format!("{key}={value}"))
-                .collect();
-            for o in overrides_strings.iter() {
-                args.push("--set");
-                args.push(&o);
+                .any(|item| item.name == self.score.release_name.to_string())
+            {
+                info!(
+                    "Release '{}' already exists in namespace '{}'. Skipping installation as install_only is true.",
+                    self.score.release_name, ns
+                );
+
+                return Ok(Outcome::success(format!(
+                    "Helm Chart '{}' already installed to namespace {ns} and install_only=true",
+                    self.score.release_name
+                )));
+            } else {
+                info!(
+                    "Release '{}' not found in namespace '{}'. Proceeding with installation.",
+                    self.score.release_name, ns
+                );
            }
        }

-        let output = run_helm_command(topology, &args)?;
+        let res = helm_executor.install_or_upgrade(
+            ns,
+            &self.score.release_name,
+            &self.score.chart_name,
+            self.score.chart_version.as_ref(),
+            self.score.values_overrides.as_ref(),
+            yaml_path,
+            Some(&helm_options),
+        );

-        if output.status.success() {
-            Ok(Outcome::success(format!(
+        let status = match res {
+            Ok(status) => status,
+            Err(err) => return Err(InterpretError::new(err.to_string())),
+        };
+
+        match status {
+            helm_wrapper_rs::HelmDeployStatus::Deployed => Ok(Outcome::success(format!(
                "Helm Chart {} deployed",
                self.score.release_name
-            )))
-        } else {
-            Err(InterpretError::new(format!(
-                "Helm Chart {} installation failed: {}",
-                self.score.release_name,
-                String::from_utf8_lossy(&output.stderr)
-            )))
+            ))),
+            helm_wrapper_rs::HelmDeployStatus::PendingInstall => Ok(Outcome::running(format!(
+                "Helm Chart {} pending install...",
+                self.score.release_name
+            ))),
+            helm_wrapper_rs::HelmDeployStatus::PendingUpgrade => Ok(Outcome::running(format!(
+                "Helm Chart {} pending upgrade...",
+                self.score.release_name
+            ))),
+            helm_wrapper_rs::HelmDeployStatus::Failed => Err(InterpretError::new(format!(
+                "Helm Chart {} installation failed",
+                self.score.release_name
+            ))),
        }
    }

--- a/harmony/src/modules/network/failover.rs
+++ b/harmony/src/modules/network/failover.rs
@@ -5,7 +5,7 @@ use crate::topology::{FailoverTopology, TlsRoute, TlsRouter};

 #[async_trait]
 impl<T: TlsRouter> TlsRouter for FailoverTopology<T> {
-    async fn get_internal_domain(&self) -> Result<Option<String>, String> {
+    async fn get_wildcard_domain(&self) -> Result<Option<String>, String> {
        todo!()
    }

--- a/harmony/src/modules/okd/crd/ingresses_config.rs
+++ b/harmony/src/modules/okd/crd/ingresses_config.rs
@@ -1,214 +0,0 @@
-use k8s_openapi::apimachinery::pkg::apis::meta::v1::{ListMeta, ObjectMeta};
-use k8s_openapi::{ClusterResourceScope, Resource};
-use serde::{Deserialize, Serialize};
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct Ingress {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub api_version: Option<String>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub kind: Option<String>,
-    pub metadata: ObjectMeta,
-
-    pub spec: IngressSpec,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub status: Option<IngressStatus>,
-}
-
-impl Resource for Ingress {
-    const API_VERSION: &'static str = "config.openshift.io/v1";
-    const GROUP: &'static str = "config.openshift.io";
-    const VERSION: &'static str = "v1";
-    const KIND: &'static str = "Ingress";
-    const URL_PATH_SEGMENT: &'static str = "ingresses";
-    type Scope = ClusterResourceScope;
-}
-
-impl k8s_openapi::Metadata for Ingress {
-    type Ty = ObjectMeta;
-
-    fn metadata(&self) -> &Self::Ty {
-        &self.metadata
-    }
-
-    fn metadata_mut(&mut self) -> &mut Self::Ty {
-        &mut self.metadata
-    }
-}
-
-impl Default for Ingress {
-    fn default() -> Self {
-        Ingress {
-            api_version: Some("config.openshift.io/v1".to_string()),
-            kind: Some("Ingress".to_string()),
-            metadata: ObjectMeta::default(),
-            spec: IngressSpec::default(),
-            status: None,
-        }
-    }
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct IngressList {
-    pub metadata: ListMeta,
-    pub items: Vec<Ingress>,
-}
-
-impl Default for IngressList {
-    fn default() -> Self {
-        Self {
-            metadata: ListMeta::default(),
-            items: Vec::new(),
-        }
-    }
-}
-
-impl Resource for IngressList {
-    const API_VERSION: &'static str = "config.openshift.io/v1";
-    const GROUP: &'static str = "config.openshift.io";
-    const VERSION: &'static str = "v1";
-    const KIND: &'static str = "IngressList";
-    const URL_PATH_SEGMENT: &'static str = "ingresses";
-    type Scope = ClusterResourceScope;
-}
-
-impl k8s_openapi::Metadata for IngressList {
-    type Ty = ListMeta;
-
-    fn metadata(&self) -> &Self::Ty {
-        &self.metadata
-    }
-
-    fn metadata_mut(&mut self) -> &mut Self::Ty {
-        &mut self.metadata
-    }
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug, Default)]
-#[serde(rename_all = "camelCase")]
-pub struct IngressSpec {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub apps_domain: Option<String>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub component_routes: Option<Vec<ComponentRouteSpec>>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub domain: Option<String>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub load_balancer: Option<LoadBalancer>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub required_hsts_policies: Option<Vec<RequiredHSTSPolicy>>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct ComponentRouteSpec {
-    pub hostname: String,
-    pub name: String,
-    pub namespace: String,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub serving_cert_key_pair_secret: Option<SecretNameReference>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct SecretNameReference {
-    pub name: String,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug, Default)]
-#[serde(rename_all = "camelCase")]
-pub struct LoadBalancer {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub platform: Option<IngressPlatform>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug, Default)]
-#[serde(rename_all = "camelCase")]
-pub struct IngressPlatform {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub aws: Option<AWSPlatformLoadBalancer>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub r#type: Option<String>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct AWSPlatformLoadBalancer {
-    pub r#type: String,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct RequiredHSTSPolicy {
-    pub domain_patterns: Vec<String>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub include_sub_domains_policy: Option<String>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub max_age: Option<MaxAgePolicy>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub namespace_selector: Option<k8s_openapi::apimachinery::pkg::apis::meta::v1::LabelSelector>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub preload_policy: Option<String>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug, Default)]
-#[serde(rename_all = "camelCase")]
-pub struct MaxAgePolicy {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub largest_max_age: Option<i32>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub smallest_max_age: Option<i32>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug, Default)]
-#[serde(rename_all = "camelCase")]
-pub struct IngressStatus {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub component_routes: Option<Vec<ComponentRouteStatus>>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub default_placement: Option<String>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct ComponentRouteStatus {
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub conditions: Option<Vec<k8s_openapi::apimachinery::pkg::apis::meta::v1::Condition>>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub consuming_users: Option<Vec<String>>,
-
-    #[serde(skip_serializing_if = "Option::is_none")]
-    pub current_hostnames: Option<Vec<String>>,
-
-    pub default_hostname: String,
-    pub name: String,
-    pub namespace: String,
-    pub related_objects: Vec<ObjectReference>,
-}
-
-#[derive(Deserialize, Serialize, Clone, Debug)]
-#[serde(rename_all = "camelCase")]
-pub struct ObjectReference {
-    pub group: String,
-    pub name: String,
-    pub namespace: String,
-    pub resource: String,
-}
-
--- a/harmony/src/modules/okd/crd/mod.rs
+++ b/harmony/src/modules/okd/crd/mod.rs
@@ -1,3 +1,2 @@
 pub mod nmstate;
 pub mod route;
-pub mod ingresses_config;
--- a/harmony/src/modules/okd/crd/route.rs
+++ b/harmony/src/modules/okd/crd/route.rs
@@ -1,4 +1,5 @@
 use k8s_openapi::apimachinery::pkg::apis::meta::v1::{ListMeta, ObjectMeta, Time};
+use k8s_openapi::apimachinery::pkg::util::intstr::IntOrString;
 use k8s_openapi::{NamespaceResourceScope, Resource};
 use serde::{Deserialize, Serialize};

--- a/harmony_cli/src/cli_logger.rs
+++ b/harmony_cli/src/cli_logger.rs
@@ -7,14 +7,11 @@ use harmony::{
 };
 use log::{error, info, log_enabled};
 use std::io::Write;
-use std::sync::{Mutex, OnceLock};
+use std::sync::Mutex;

 pub fn init() {
-    static INITIALIZED: OnceLock<()> = OnceLock::new();
-    INITIALIZED.get_or_init(|| {
-        configure_logger();
-        handle_events();
-    });
+    configure_logger();
+    handle_events();
 }

 fn configure_logger() {