NationTech/harmony

Fork 2

Files

Jean-Gabriel Gill-Couture af6145afe3

Run Check Script / check (pull_request) Successful in 1m23s

Details

doc: monitoring module documentation

2026-03-09 18:33:35 -04:00

15 KiB

Raw Permalink Blame History

Monitoring and Alerting in Harmony

Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.

Overview

Harmony's monitoring module supports three distinct use cases:

Level	Who Uses It	What It Provides
Cluster	Cluster administrators	Full control over monitoring stack, cluster-wide alerts, external scrape targets
Tenant	Platform teams	Namespace-scoped monitoring in multi-tenant environments
Application	Application developers	Zero-config monitoring that "just works"

Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.

Core Concepts

AlertSender

An AlertSender represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:

Sender	Description	Use When
`OpenshiftClusterAlertSender`	OKD/OpenShift built-in monitoring	Running on OKD/OpenShift
`KubePrometheus`	kube-prometheus-stack via Helm	Standard Kubernetes, need full stack
`Prometheus`	Standalone Prometheus	Custom Prometheus deployment
`RedHatClusterObservability`	RHOB operator	Red Hat managed clusters
`Grafana`	Grafana-managed alerting	Grafana as primary alerting layer

AlertReceiver

An AlertReceiver defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.

pub trait AlertReceiver<S: AlertSender> {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
    fn name(&self) -> String;
}

Built-in receivers:

DiscordReceiver - Discord webhooks
WebhookReceiver - Generic HTTP webhooks

AlertRule

An AlertRule defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.

pub trait AlertRule<S: AlertSender> {
    fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
    fn name(&self) -> String;
}

Observability Capability

Topologies implement Observability<S> to indicate they support a specific alert sender:

impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
    async fn install_receivers(&self, sender, inventory, receivers) { ... }
    async fn install_rules(&self, sender, inventory, rules) { ... }
    // ...
}

This provides compile-time verification: if you try to use OpenshiftClusterAlertScore with a topology that doesn't implement Observability<OpenshiftClusterAlertSender>, the code won't compile.

Level 1: Cluster Monitoring

Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:

Installing/managing the monitoring stack
Configuring cluster-wide alert receivers
Defining cluster-level alert rules
Adding external scrape targets (e.g., bare-metal servers, firewalls)

Example: OKD Cluster Alerts

use harmony::{
    modules::monitoring::{
        alert_channel::discord_alert_channel::DiscordReceiver,
        alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
        okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
        scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
    },
    topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};

let severity_matcher = AlertMatcher {
    label: "severity".to_string(),
    operator: MatchOp::Eq,
    value: "critical".to_string(),
};

let rule_group = AlertManagerRuleGroup::new(
    "cluster-rules",
    vec![high_pvc_fill_rate_over_two_days()],
);

let external_exporter = PrometheusNodeExporter {
    job_name: "firewall".to_string(),
    metrics_path: "/metrics".to_string(),
    listen_address: ip!("192.168.1.1"),
    port: 9100,
    ..Default::default()
};

harmony_cli::run(
    Inventory::autoload(),
    K8sAnywhereTopology::from_env(),
    vec![Box::new(OpenshiftClusterAlertScore {
        sender: OpenshiftClusterAlertSender,
        receivers: vec![Box::new(DiscordReceiver {
            name: "critical-alerts".to_string(),
            url: hurl!("https://discord.com/api/webhooks/..."),
            route: AlertRoute {
                matchers: vec![severity_matcher],
                ..AlertRoute::default("critical-alerts".to_string())
            },
        })],
        rules: vec![Box::new(rule_group)],
        scrape_targets: Some(vec![Box::new(external_exporter)]),
    })],
    None,
).await?;

What This Does

Enables cluster monitoring - Activates OKD's built-in Prometheus
Enables user workload monitoring - Allows namespace-scoped rules
Configures Alertmanager - Adds Discord receiver with route matching
Deploys alert rules - Creates AlertingRule CRD with PVC fill rate alert
Adds external scrape target - Configures Prometheus to scrape the firewall

Compile-Time Safety

The OpenshiftClusterAlertScore requires:

impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
    for OpenshiftClusterAlertScore

If K8sAnywhereTopology didn't implement Observability<OpenshiftClusterAlertSender>, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.

Level 2: Tenant Monitoring

In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:

Resources are deployed in the tenant's namespace
Cannot modify cluster-level monitoring configuration
The topology determines namespace context at runtime

How It Works

The topology's Observability implementation handles tenant scoping:

impl Observability<KubePrometheus> for K8sAnywhereTopology {
    async fn install_rules(&self, sender, inventory, rules) {
        // Topology knows if it's tenant-scoped
        let namespace = self.get_tenant_config().await
            .map(|t| t.name)
            .unwrap_or_else(|| "monitoring".to_string());
        
        // Rules are installed in the appropriate namespace
        for rule in rules.unwrap_or_default() {
            let score = KubePrometheusRuleScore {
                sender: sender.clone(),
                rule,
                namespace: namespace.clone(), // Tenant namespace
            };
            score.create_interpret().execute(inventory, self).await?;
        }
    }
}

Tenant vs Cluster Resources

Resource	Cluster-Level	Tenant-Level
Alertmanager config	Global receivers	Namespaced receivers (where supported)
PrometheusRules	Cluster-wide alerts	Namespace alerts only
ServiceMonitors	Any namespace	Own namespace only
External scrape targets	Can add	Cannot add (cluster config)

Runtime Validation

Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.

This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.

Level 3: Application Monitoring

Application monitoring provides zero-config, opinionated monitoring for developers. Just add the Monitoring feature to your application and it works.

Example

use harmony::modules::{
    application::{Application, ApplicationFeature},
    monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};

// Define your application
let my_app = MyApplication::new();

// Add monitoring as a feature
let monitoring = Monitoring {
    application: Arc::new(my_app),
    alert_receiver: vec![], // Uses defaults
};

// Install with the application
my_app.add_feature(monitoring);

What Application Monitoring Provides

Automatic ServiceMonitor - Creates a ServiceMonitor for your application's pods
Ntfy Notification Channel - Auto-installs and configures Ntfy for push notifications
Tenant Awareness - Automatically scopes to the correct namespace
Sensible Defaults - Pre-configured alert routes and receivers

Under the Hood

impl<T: Topology + Observability<Prometheus> + TenantManager> 
    ApplicationFeature<T> for Monitoring 
{
    async fn ensure_installed(&self, topology: &T) -> Result<...> {
        // 1. Get tenant namespace (or use app name)
        let namespace = topology.get_tenant_config().await
            .map(|ns| ns.name.clone())
            .unwrap_or_else(|| self.application.name());

        // 2. Create ServiceMonitor for the app
        let app_service_monitor = ServiceMonitor {
            metadata: ObjectMeta {
                name: Some(self.application.name()),
                namespace: Some(namespace.clone()),
                ..Default::default()
            },
            spec: ServiceMonitorSpec::default(),
        };

        // 3. Install Ntfy for notifications
        let ntfy = NtfyScore { namespace, host };
        ntfy.interpret(&Inventory::empty(), topology).await?;

        // 4. Wire up webhook receiver to Ntfy
        let ntfy_receiver = WebhookReceiver { ... };
        
        // 5. Execute monitoring score
        alerting_score.interpret(&Inventory::empty(), topology).await?;
    }
}

Pre-Built Alert Rules

Harmony provides a library of common alert rules in modules/monitoring/alert_rule/alerts/:

Kubernetes Alerts (`alerts/k8s/`)

use harmony::modules::monitoring::alert_rule::alerts::k8s::{
    pod::pod_failed,
    pvc::high_pvc_fill_rate_over_two_days,
    memory_usage::alert_high_memory_usage,
};

let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
    pod_failed(),
    high_pvc_fill_rate_over_two_days(),
    alert_high_memory_usage(),
]);

Available rules:

pod_failed() - Pod in failed state
alert_container_restarting() - Container restart loop
alert_pod_not_ready() - Pod not ready for extended period
high_pvc_fill_rate_over_two_days() - PVC will fill within 2 days
alert_high_memory_usage() - Memory usage above threshold
alert_high_cpu_usage() - CPU usage above threshold

Infrastructure Alerts (`alerts/infra/`)

use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;

let rules = AlertManagerRuleGroup::new("infra-rules", vec![
    high_http_error_rate(),
]);

Creating Custom Rules

use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;

pub fn my_custom_alert() -> PrometheusAlertRule {
    PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
        .for_duration("5m")
        .label("severity", "critical")
        .annotation("summary", "My service is down")
        .annotation("description", "The my-service job has been down for more than 5 minutes")
}

Alert Receivers

Discord Webhook

use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};

let discord = DiscordReceiver {
    name: "ops-alerts".to_string(),
    url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
    route: AlertRoute {
        receiver: "ops-alerts".to_string(),
        matchers: vec![AlertMatcher {
            label: "severity".to_string(),
            operator: MatchOp::Eq,
            value: "critical".to_string(),
        }],
        group_by: vec!["alertname".to_string()],
        repeat_interval: Some("30m".to_string()),
        continue_matching: false,
        children: vec![],
    },
};

Generic Webhook

use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;

let webhook = WebhookReceiver {
    name: "custom-webhook".to_string(),
    url: hurl!("https://api.example.com/alerts"),
    route: AlertRoute::default("custom-webhook".to_string()),
};

Adding a New Monitoring Stack

To add support for a new monitoring stack:

Create the sender type in modules/monitoring/my_sender/mod.rs:

#[derive(Debug, Clone)]
pub struct MySender;

impl AlertSender for MySender {
    fn name(&self) -> String { "MySender".to_string() }
}

Define CRD types in modules/monitoring/my_sender/crd/:

#[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
#[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
pub struct MyAlertRuleSpec { ... }

Implement Observability in domain/topology/k8s_anywhere/observability/my_sender.rs:

impl Observability<MySender> for K8sAnywhereTopology {
    async fn install_receivers(&self, sender, inventory, receivers) { ... }
    async fn install_rules(&self, sender, inventory, rules) { ... }
    // ...
}

Implement receiver conversions for existing receivers:

impl AlertReceiver<MySender> for DiscordReceiver {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
        // Convert DiscordReceiver to MySender's format
    }
}

Create score types:

pub struct MySenderAlertScore {
    pub sender: MySender,
    pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
    pub rules: Vec<Box<dyn AlertRule<MySender>>>,
}

Architecture Principles

Type Safety Over Flexibility

Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.

Compile-Time Capability Verification

The Observability<S> bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.

Explicit Over Implicit

Monitoring stacks are chosen explicitly (OpenshiftClusterAlertSender vs KubePrometheus). There's no "auto-detection" that could lead to surprising behavior.

Three Levels, One Foundation

Cluster, tenant, and application monitoring all use the same traits (AlertSender, AlertReceiver, AlertRule). The difference is in how scores are constructed and how topologies interpret them.

15 KiB Raw Permalink Blame History