Files
harmony/adr/020-monitoring-alerting-architecture.md
Jean-Gabriel Gill-Couture af6145afe3
All checks were successful
Run Check Script / check (pull_request) Successful in 1m23s
doc: monitoring module documentation
2026-03-09 18:33:35 -04:00

12 KiB

Architecture Decision Record: Monitoring and Alerting Architecture

Initial Author: Willem Rolleman, Jean-Gabriel Carrier

Initial Date: March 9, 2026

Last Updated Date: March 9, 2026

Status

Accepted

Supersedes: ADR-010

Context

Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:

  1. Cluster-level monitoring: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.

  2. Tenant-level monitoring: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.

  3. Application-level monitoring: Developers deploying applications want zero-config monitoring that "just works" for their services.

The monitoring landscape is fragmented:

  • OKD/OpenShift: Built-in Prometheus with AlertmanagerConfig CRDs
  • KubePrometheus: Helm-based stack with PrometheusRule CRDs
  • RHOB (Red Hat Observability): Operator-based with MonitoringStack CRDs
  • Standalone Prometheus: Raw Prometheus deployments

Each system has different CRDs, different installation methods, and different configuration APIs.

Decision

We implement a trait-based architecture with compile-time capability verification that provides:

  1. Type-safe abstractions via parameterized traits: AlertReceiver<S>, AlertRule<S>, ScrapeTarget<S>
  2. Compile-time topology compatibility via the Observability<S> capability bound
  3. Three levels of abstraction: Cluster, Tenant, and Application monitoring
  4. Pre-built alert rules as functions that return typed structs

Core Traits

// domain/topology/monitoring.rs

/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
    fn name(&self) -> String;
}

/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
    fn name(&self) -> String;
    fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}

/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
    fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
    fn name(&self) -> String;
    fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}

/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
    async fn install_alert_sender(&self, sender: &S, inventory: &Inventory) 
        -> Result<PreparationOutcome, PreparationError>;
    async fn install_receivers(&self, sender: &S, inventory: &Inventory, 
        receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
    async fn install_rules(&self, sender: &S, inventory: &Inventory,
        rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
    async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
        scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
    async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory) 
        -> Result<...>;
}

Alert Sender Types

Each monitoring stack is a distinct AlertSender:

Sender Module Use Case
OpenshiftClusterAlertSender monitoring/okd/ OKD/OpenShift built-in monitoring
KubePrometheus monitoring/kube_prometheus/ Helm-deployed kube-prometheus-stack
Prometheus monitoring/prometheus/ Standalone Prometheus via Helm
RedHatClusterObservability monitoring/red_hat_cluster_observability/ RHOB operator
Grafana monitoring/grafana/ Grafana-managed alerting

Three Levels of Monitoring

1. Cluster-Level Monitoring

For cluster administrators. Full control over monitoring infrastructure.

// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
    sender: OpenshiftClusterAlertSender,
    receivers: vec![Box::new(DiscordReceiver { ... })],
    rules: vec![Box::new(alert_rules)],
    scrape_targets: Some(vec![Box::new(external_exporters)]),
}

Characteristics:

  • Cluster-scoped CRDs and resources
  • Can add external scrape targets (outside cluster)
  • Manages Alertmanager configuration
  • Requires cluster-admin privileges

2. Tenant-Level Monitoring

For teams confined to namespaces. The topology determines tenant context.

// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
    async fn install_rules(&self, sender: &KubePrometheus, ...) {
        // Topology knows if it's tenant-scoped
        let namespace = self.get_tenant_config().await
            .map(|t| t.name)
            .unwrap_or("default");
        // Install rules in tenant namespace
    }
}

Characteristics:

  • Namespace-scoped resources
  • Cannot modify cluster-level monitoring config
  • May have restricted receiver types
  • Runtime validation of permissions (cannot be fully compile-time)

3. Application-Level Monitoring

For developers. Zero-config, opinionated monitoring.

// modules/application/features/monitoring.rs
pub struct Monitoring {
    pub application: Arc<dyn Application>,
    pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}

impl<T: Topology + Observability<Prometheus> + TenantManager + ...> 
    ApplicationFeature<T> for Monitoring 
{
    async fn ensure_installed(&self, topology: &T) -> Result<...> {
        // Auto-creates ServiceMonitor
        // Auto-installs Ntfy for notifications
        // Handles tenant namespace automatically
        // Wires up sensible defaults
    }
}

Characteristics:

  • Automatic ServiceMonitor creation
  • Opinionated notification channel (Ntfy)
  • Tenant-aware via topology
  • Minimal configuration required

Rationale

Why Generic Traits Instead of Unified Types?

Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:

// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }

// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }

// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }

A unified type would either:

  1. Be a lowest-common-denominator (loses stack-specific features)
  2. Be a complex union type (hard to use, easy to misconfigure)

Generic traits let each stack express its configuration naturally while providing a consistent interface.

Why Compile-Time Capability Bounds?

impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T> 
    for OpenshiftClusterAlertScore { ... }

This fails at compile time if you try to use OpenshiftClusterAlertScore with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.

Why Not a MonitoringStack Abstraction (V2 Approach)?

The V2 approach proposed a unified MonitoringStack that hides sender selection:

// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
    .add_alert_channel(discord)

Problems:

  1. Hides which sender you're using, losing compile-time guarantees
  2. "Version selection" actually chooses between fundamentally different systems
  3. Would need to handle all stack-specific features through a generic interface

The current approach is explicit: you choose OpenshiftClusterAlertSender and the compiler verifies compatibility.

Why Runtime Validation for Tenants?

Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.

Options considered:

  1. Compile-time tenant markers - Would require modeling entire RBAC hierarchy in types. Over-engineering.
  2. Runtime validation - Current approach. Fails with clear K8s permission errors if insufficient access.
  3. No tenant support - Would exclude a major use case.

Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.

Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.

Consequences

Pros

  1. Type Safety: Invalid configurations are caught at compile time
  2. Extensibility: Adding a new monitoring stack requires implementing traits, not modifying core code
  3. Clear Separation: Cluster/Tenant/Application levels have distinct entry points
  4. Reusable Rules: Pre-built alert rules as functions (high_pvc_fill_rate_over_two_days())
  5. CRD Accuracy: Type definitions match actual Kubernetes CRDs exactly

Cons

  1. Implementation Explosion: DiscordReceiver implements AlertReceiver<S> for each sender type (3+ implementations)
  2. Learning Curve: Understanding the trait hierarchy takes time
  3. clone_box Boilerplate: Required for trait object cloning (3 lines per impl)

Mitigations

  • Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
  • Learning curve is documented with examples at each level
  • clone_box boilerplate is minimal and copy-paste

Alternatives Considered

Unified MonitoringStack Type

See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.

Helm-Only Approach

Use HelmScore directly for each monitoring deployment. Rejected because:

  • No type safety for alert rules
  • Cannot compose with application features
  • No tenant awareness

Separate Modules Per Use Case

Have cluster_monitoring/, tenant_monitoring/, app_monitoring/ as separate modules. Rejected because:

  • Massive code duplication
  • No shared abstraction for receivers/rules
  • Adding a feature requires three implementations

Implementation Notes

Module Structure

modules/monitoring/
├── mod.rs                     # Public exports
├── alert_channel/             # Receivers (Discord, Webhook)
├── alert_rule/                # Rules and pre-built alerts
│   ├── prometheus_alert_rule.rs
│   └── alerts/                # Library of pre-built rules
│       ├── k8s/               # K8s-specific (pvc, pod, memory)
│       └── infra/             # Infrastructure (opnsense, dell)
├── okd/                       # OpenshiftClusterAlertSender
├── kube_prometheus/           # KubePrometheus
├── prometheus/                # Prometheus
├── red_hat_cluster_observability/  # RHOB
├── grafana/                   # Grafana
├── application_monitoring/    # Application-level scores
└── scrape_target/             # External scrape targets

Adding a New Alert Sender

  1. Create sender type: pub struct MySender; impl AlertSender for MySender { ... }
  2. Implement Observability<MySender> for topologies that support it
  3. Create CRD types in crd/ subdirectory
  4. Implement AlertReceiver<MySender> for existing receivers
  5. Implement AlertRule<MySender> for AlertManagerRuleGroup

Adding a New Alert Rule

pub fn my_custom_alert() -> PrometheusAlertRule {
    PrometheusAlertRule::new("MyAlert", "up == 0")
        .for_duration("5m")
        .label("severity", "critical")
        .annotation("summary", "Service is down")
}

No trait implementation needed - AlertManagerRuleGroup already handles conversion.

  • ADR-013: Notification channel selection (ntfy)
  • ADR-011: Multi-tenant cluster architecture