Files
harmony/docs/monitoring.md
Jean-Gabriel Gill-Couture af6145afe3
All checks were successful
Run Check Script / check (pull_request) Successful in 1m23s
doc: monitoring module documentation
2026-03-09 18:33:35 -04:00

15 KiB

Monitoring and Alerting in Harmony

Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.

Overview

Harmony's monitoring module supports three distinct use cases:

Level Who Uses It What It Provides
Cluster Cluster administrators Full control over monitoring stack, cluster-wide alerts, external scrape targets
Tenant Platform teams Namespace-scoped monitoring in multi-tenant environments
Application Application developers Zero-config monitoring that "just works"

Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.

Core Concepts

AlertSender

An AlertSender represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:

Sender Description Use When
OpenshiftClusterAlertSender OKD/OpenShift built-in monitoring Running on OKD/OpenShift
KubePrometheus kube-prometheus-stack via Helm Standard Kubernetes, need full stack
Prometheus Standalone Prometheus Custom Prometheus deployment
RedHatClusterObservability RHOB operator Red Hat managed clusters
Grafana Grafana-managed alerting Grafana as primary alerting layer

AlertReceiver

An AlertReceiver defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.

pub trait AlertReceiver<S: AlertSender> {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
    fn name(&self) -> String;
}

Built-in receivers:

  • DiscordReceiver - Discord webhooks
  • WebhookReceiver - Generic HTTP webhooks

AlertRule

An AlertRule defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.

pub trait AlertRule<S: AlertSender> {
    fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
    fn name(&self) -> String;
}

Observability Capability

Topologies implement Observability<S> to indicate they support a specific alert sender:

impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
    async fn install_receivers(&self, sender, inventory, receivers) { ... }
    async fn install_rules(&self, sender, inventory, rules) { ... }
    // ...
}

This provides compile-time verification: if you try to use OpenshiftClusterAlertScore with a topology that doesn't implement Observability<OpenshiftClusterAlertSender>, the code won't compile.


Level 1: Cluster Monitoring

Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:

  • Installing/managing the monitoring stack
  • Configuring cluster-wide alert receivers
  • Defining cluster-level alert rules
  • Adding external scrape targets (e.g., bare-metal servers, firewalls)

Example: OKD Cluster Alerts

use harmony::{
    modules::monitoring::{
        alert_channel::discord_alert_channel::DiscordReceiver,
        alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
        okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
        scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
    },
    topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};

let severity_matcher = AlertMatcher {
    label: "severity".to_string(),
    operator: MatchOp::Eq,
    value: "critical".to_string(),
};

let rule_group = AlertManagerRuleGroup::new(
    "cluster-rules",
    vec![high_pvc_fill_rate_over_two_days()],
);

let external_exporter = PrometheusNodeExporter {
    job_name: "firewall".to_string(),
    metrics_path: "/metrics".to_string(),
    listen_address: ip!("192.168.1.1"),
    port: 9100,
    ..Default::default()
};

harmony_cli::run(
    Inventory::autoload(),
    K8sAnywhereTopology::from_env(),
    vec![Box::new(OpenshiftClusterAlertScore {
        sender: OpenshiftClusterAlertSender,
        receivers: vec![Box::new(DiscordReceiver {
            name: "critical-alerts".to_string(),
            url: hurl!("https://discord.com/api/webhooks/..."),
            route: AlertRoute {
                matchers: vec![severity_matcher],
                ..AlertRoute::default("critical-alerts".to_string())
            },
        })],
        rules: vec![Box::new(rule_group)],
        scrape_targets: Some(vec![Box::new(external_exporter)]),
    })],
    None,
).await?;

What This Does

  1. Enables cluster monitoring - Activates OKD's built-in Prometheus
  2. Enables user workload monitoring - Allows namespace-scoped rules
  3. Configures Alertmanager - Adds Discord receiver with route matching
  4. Deploys alert rules - Creates AlertingRule CRD with PVC fill rate alert
  5. Adds external scrape target - Configures Prometheus to scrape the firewall

Compile-Time Safety

The OpenshiftClusterAlertScore requires:

impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
    for OpenshiftClusterAlertScore

If K8sAnywhereTopology didn't implement Observability<OpenshiftClusterAlertSender>, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.


Level 2: Tenant Monitoring

In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:

  • Resources are deployed in the tenant's namespace
  • Cannot modify cluster-level monitoring configuration
  • The topology determines namespace context at runtime

How It Works

The topology's Observability implementation handles tenant scoping:

impl Observability<KubePrometheus> for K8sAnywhereTopology {
    async fn install_rules(&self, sender, inventory, rules) {
        // Topology knows if it's tenant-scoped
        let namespace = self.get_tenant_config().await
            .map(|t| t.name)
            .unwrap_or_else(|| "monitoring".to_string());
        
        // Rules are installed in the appropriate namespace
        for rule in rules.unwrap_or_default() {
            let score = KubePrometheusRuleScore {
                sender: sender.clone(),
                rule,
                namespace: namespace.clone(), // Tenant namespace
            };
            score.create_interpret().execute(inventory, self).await?;
        }
    }
}

Tenant vs Cluster Resources

Resource Cluster-Level Tenant-Level
Alertmanager config Global receivers Namespaced receivers (where supported)
PrometheusRules Cluster-wide alerts Namespace alerts only
ServiceMonitors Any namespace Own namespace only
External scrape targets Can add Cannot add (cluster config)

Runtime Validation

Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.

This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.


Level 3: Application Monitoring

Application monitoring provides zero-config, opinionated monitoring for developers. Just add the Monitoring feature to your application and it works.

Example

use harmony::modules::{
    application::{Application, ApplicationFeature},
    monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};

// Define your application
let my_app = MyApplication::new();

// Add monitoring as a feature
let monitoring = Monitoring {
    application: Arc::new(my_app),
    alert_receiver: vec![], // Uses defaults
};

// Install with the application
my_app.add_feature(monitoring);

What Application Monitoring Provides

  1. Automatic ServiceMonitor - Creates a ServiceMonitor for your application's pods
  2. Ntfy Notification Channel - Auto-installs and configures Ntfy for push notifications
  3. Tenant Awareness - Automatically scopes to the correct namespace
  4. Sensible Defaults - Pre-configured alert routes and receivers

Under the Hood

impl<T: Topology + Observability<Prometheus> + TenantManager> 
    ApplicationFeature<T> for Monitoring 
{
    async fn ensure_installed(&self, topology: &T) -> Result<...> {
        // 1. Get tenant namespace (or use app name)
        let namespace = topology.get_tenant_config().await
            .map(|ns| ns.name.clone())
            .unwrap_or_else(|| self.application.name());

        // 2. Create ServiceMonitor for the app
        let app_service_monitor = ServiceMonitor {
            metadata: ObjectMeta {
                name: Some(self.application.name()),
                namespace: Some(namespace.clone()),
                ..Default::default()
            },
            spec: ServiceMonitorSpec::default(),
        };

        // 3. Install Ntfy for notifications
        let ntfy = NtfyScore { namespace, host };
        ntfy.interpret(&Inventory::empty(), topology).await?;

        // 4. Wire up webhook receiver to Ntfy
        let ntfy_receiver = WebhookReceiver { ... };
        
        // 5. Execute monitoring score
        alerting_score.interpret(&Inventory::empty(), topology).await?;
    }
}

Pre-Built Alert Rules

Harmony provides a library of common alert rules in modules/monitoring/alert_rule/alerts/:

Kubernetes Alerts (alerts/k8s/)

use harmony::modules::monitoring::alert_rule::alerts::k8s::{
    pod::pod_failed,
    pvc::high_pvc_fill_rate_over_two_days,
    memory_usage::alert_high_memory_usage,
};

let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
    pod_failed(),
    high_pvc_fill_rate_over_two_days(),
    alert_high_memory_usage(),
]);

Available rules:

  • pod_failed() - Pod in failed state
  • alert_container_restarting() - Container restart loop
  • alert_pod_not_ready() - Pod not ready for extended period
  • high_pvc_fill_rate_over_two_days() - PVC will fill within 2 days
  • alert_high_memory_usage() - Memory usage above threshold
  • alert_high_cpu_usage() - CPU usage above threshold

Infrastructure Alerts (alerts/infra/)

use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;

let rules = AlertManagerRuleGroup::new("infra-rules", vec![
    high_http_error_rate(),
]);

Creating Custom Rules

use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;

pub fn my_custom_alert() -> PrometheusAlertRule {
    PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
        .for_duration("5m")
        .label("severity", "critical")
        .annotation("summary", "My service is down")
        .annotation("description", "The my-service job has been down for more than 5 minutes")
}

Alert Receivers

Discord Webhook

use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};

let discord = DiscordReceiver {
    name: "ops-alerts".to_string(),
    url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
    route: AlertRoute {
        receiver: "ops-alerts".to_string(),
        matchers: vec![AlertMatcher {
            label: "severity".to_string(),
            operator: MatchOp::Eq,
            value: "critical".to_string(),
        }],
        group_by: vec!["alertname".to_string()],
        repeat_interval: Some("30m".to_string()),
        continue_matching: false,
        children: vec![],
    },
};

Generic Webhook

use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;

let webhook = WebhookReceiver {
    name: "custom-webhook".to_string(),
    url: hurl!("https://api.example.com/alerts"),
    route: AlertRoute::default("custom-webhook".to_string()),
};

Adding a New Monitoring Stack

To add support for a new monitoring stack:

  1. Create the sender type in modules/monitoring/my_sender/mod.rs:

    #[derive(Debug, Clone)]
    pub struct MySender;
    
    impl AlertSender for MySender {
        fn name(&self) -> String { "MySender".to_string() }
    }
    
  2. Define CRD types in modules/monitoring/my_sender/crd/:

    #[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
    #[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
    pub struct MyAlertRuleSpec { ... }
    
  3. Implement Observability in domain/topology/k8s_anywhere/observability/my_sender.rs:

    impl Observability<MySender> for K8sAnywhereTopology {
        async fn install_receivers(&self, sender, inventory, receivers) { ... }
        async fn install_rules(&self, sender, inventory, rules) { ... }
        // ...
    }
    
  4. Implement receiver conversions for existing receivers:

    impl AlertReceiver<MySender> for DiscordReceiver {
        fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
            // Convert DiscordReceiver to MySender's format
        }
    }
    
  5. Create score types:

    pub struct MySenderAlertScore {
        pub sender: MySender,
        pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
        pub rules: Vec<Box<dyn AlertRule<MySender>>>,
    }
    

Architecture Principles

Type Safety Over Flexibility

Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.

Compile-Time Capability Verification

The Observability<S> bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.

Explicit Over Implicit

Monitoring stacks are chosen explicitly (OpenshiftClusterAlertSender vs KubePrometheus). There's no "auto-detection" that could lead to surprising behavior.

Three Levels, One Foundation

Cluster, tenant, and application monitoring all use the same traits (AlertSender, AlertReceiver, AlertRule). The difference is in how scores are constructed and how topologies interpret them.