harmony/adr/020-monitoring-alerting-architecture.md

# Architecture Decision Record: Monitoring and Alerting Architecture

Initial Author: Willem Rolleman, Jean-Gabriel Carrier

Initial Date: March 9, 2026

Last Updated Date: March 9, 2026

## Status

Accepted

Supersedes: [ADR-010](010-monitoring-and-alerting.md)

## Context

Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:

1. **Cluster-level monitoring**: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.

2. **Tenant-level monitoring**: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.

3. **Application-level monitoring**: Developers deploying applications want zero-config monitoring that "just works" for their services.

The monitoring landscape is fragmented:
- **OKD/OpenShift**: Built-in Prometheus with AlertmanagerConfig CRDs
- **KubePrometheus**: Helm-based stack with PrometheusRule CRDs
- **RHOB (Red Hat Observability)**: Operator-based with MonitoringStack CRDs
- **Standalone Prometheus**: Raw Prometheus deployments

Each system has different CRDs, different installation methods, and different configuration APIs.

## Decision

We implement a **trait-based architecture with compile-time capability verification** that provides:

1. **Type-safe abstractions** via parameterized traits: `AlertReceiver<S>`, `AlertRule<S>`, `ScrapeTarget<S>`
2. **Compile-time topology compatibility** via the `Observability<S>` capability bound
3. **Three levels of abstraction**: Cluster, Tenant, and Application monitoring
4. **Pre-built alert rules** as functions that return typed structs

### Core Traits

```rust
// domain/topology/monitoring.rs

/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
    fn name(&self) -> String;
}

/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
    fn name(&self) -> String;
    fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}

/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
    fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
    fn name(&self) -> String;
    fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}

/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
    async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
        -> Result<PreparationOutcome, PreparationError>;
    async fn install_receivers(&self, sender: &S, inventory: &Inventory,
        receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
    async fn install_rules(&self, sender: &S, inventory: &Inventory,
        rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
    async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
        scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
    async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
        -> Result<...>;
}
```

### Alert Sender Types

Each monitoring stack is a distinct `AlertSender`:

| Sender | Module | Use Case |
|--------|--------|----------|
| `OpenshiftClusterAlertSender` | `monitoring/okd/` | OKD/OpenShift built-in monitoring |
| `KubePrometheus` | `monitoring/kube_prometheus/` | Helm-deployed kube-prometheus-stack |
| `Prometheus` | `monitoring/prometheus/` | Standalone Prometheus via Helm |
| `RedHatClusterObservability` | `monitoring/red_hat_cluster_observability/` | RHOB operator |
| `Grafana` | `monitoring/grafana/` | Grafana-managed alerting |

### Three Levels of Monitoring

#### 1. Cluster-Level Monitoring

For cluster administrators. Full control over monitoring infrastructure.

```rust
// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
    sender: OpenshiftClusterAlertSender,
    receivers: vec![Box::new(DiscordReceiver { ... })],
    rules: vec![Box::new(alert_rules)],
    scrape_targets: Some(vec![Box::new(external_exporters)]),
}
```

**Characteristics:**
- Cluster-scoped CRDs and resources
- Can add external scrape targets (outside cluster)
- Manages Alertmanager configuration
- Requires cluster-admin privileges

#### 2. Tenant-Level Monitoring

For teams confined to namespaces. The topology determines tenant context.

```rust
// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
    async fn install_rules(&self, sender: &KubePrometheus, ...) {
        // Topology knows if it's tenant-scoped
        let namespace = self.get_tenant_config().await
            .map(|t| t.name)
            .unwrap_or("default");
        // Install rules in tenant namespace
    }
}
```

**Characteristics:**
- Namespace-scoped resources
- Cannot modify cluster-level monitoring config
- May have restricted receiver types
- Runtime validation of permissions (cannot be fully compile-time)

#### 3. Application-Level Monitoring

For developers. Zero-config, opinionated monitoring.

```rust
// modules/application/features/monitoring.rs
pub struct Monitoring {
    pub application: Arc<dyn Application>,
    pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}

impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
    ApplicationFeature<T> for Monitoring
{
    async fn ensure_installed(&self, topology: &T) -> Result<...> {
        // Auto-creates ServiceMonitor
        // Auto-installs Ntfy for notifications
        // Handles tenant namespace automatically
        // Wires up sensible defaults
    }
}
```

**Characteristics:**
- Automatic ServiceMonitor creation
- Opinionated notification channel (Ntfy)
- Tenant-aware via topology
- Minimal configuration required

## Rationale

### Why Generic Traits Instead of Unified Types?

Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:

```rust
// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }

// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }

// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }
```

A unified type would either:
1. Be a lowest-common-denominator (loses stack-specific features)
2. Be a complex union type (hard to use, easy to misconfigure)

Generic traits let each stack express its configuration naturally while providing a consistent interface.

### Why Compile-Time Capability Bounds?

```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
    for OpenshiftClusterAlertScore { ... }
```

This fails at compile time if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.

### Why Not a MonitoringStack Abstraction (V2 Approach)?

The V2 approach proposed a unified `MonitoringStack` that hides sender selection:

```rust
// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
    .add_alert_channel(discord)
```

**Problems:**
1. Hides which sender you're using, losing compile-time guarantees
2. "Version selection" actually chooses between fundamentally different systems
3. Would need to handle all stack-specific features through a generic interface

The current approach is explicit: you choose `OpenshiftClusterAlertSender` and the compiler verifies compatibility.

### Why Runtime Validation for Tenants?

Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.

Options considered:
1. **Compile-time tenant markers** - Would require modeling entire RBAC hierarchy in types. Over-engineering.
2. **Runtime validation** - Current approach. Fails with clear K8s permission errors if insufficient access.
3. **No tenant support** - Would exclude a major use case.

Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.

> Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.

## Consequences

### Pros

1. **Type Safety**: Invalid configurations are caught at compile time
2. **Extensibility**: Adding a new monitoring stack requires implementing traits, not modifying core code
3. **Clear Separation**: Cluster/Tenant/Application levels have distinct entry points
4. **Reusable Rules**: Pre-built alert rules as functions (`high_pvc_fill_rate_over_two_days()`)
5. **CRD Accuracy**: Type definitions match actual Kubernetes CRDs exactly

### Cons

1. **Implementation Explosion**: `DiscordReceiver` implements `AlertReceiver<S>` for each sender type (3+ implementations)
2. **Learning Curve**: Understanding the trait hierarchy takes time
3. **clone_box Boilerplate**: Required for trait object cloning (3 lines per impl)

### Mitigations

- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
- Learning curve is documented with examples at each level
- clone_box boilerplate is minimal and copy-paste

## Alternatives Considered

### Unified MonitoringStack Type

See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.

### Helm-Only Approach

Use `HelmScore` directly for each monitoring deployment. Rejected because:
- No type safety for alert rules
- Cannot compose with application features
- No tenant awareness

### Separate Modules Per Use Case

Have `cluster_monitoring/`, `tenant_monitoring/`, `app_monitoring/` as separate modules. Rejected because:
- Massive code duplication
- No shared abstraction for receivers/rules
- Adding a feature requires three implementations

## Implementation Notes

### Module Structure

```
modules/monitoring/
├── mod.rs                     # Public exports
├── alert_channel/             # Receivers (Discord, Webhook)
├── alert_rule/                # Rules and pre-built alerts
│   ├── prometheus_alert_rule.rs
│   └── alerts/                # Library of pre-built rules
│       ├── k8s/               # K8s-specific (pvc, pod, memory)
│       └── infra/             # Infrastructure (opnsense, dell)
├── okd/                       # OpenshiftClusterAlertSender
├── kube_prometheus/           # KubePrometheus
├── prometheus/                # Prometheus
├── red_hat_cluster_observability/  # RHOB
├── grafana/                   # Grafana
├── application_monitoring/    # Application-level scores
└── scrape_target/             # External scrape targets
```

### Adding a New Alert Sender

1. Create sender type: `pub struct MySender; impl AlertSender for MySender { ... }`
2. Implement `Observability<MySender>` for topologies that support it
3. Create CRD types in `crd/` subdirectory
4. Implement `AlertReceiver<MySender>` for existing receivers
5. Implement `AlertRule<MySender>` for `AlertManagerRuleGroup`

### Adding a New Alert Rule

```rust
pub fn my_custom_alert() -> PrometheusAlertRule {
    PrometheusAlertRule::new("MyAlert", "up == 0")
        .for_duration("5m")
        .label("severity", "critical")
        .annotation("summary", "Service is down")
}
```

No trait implementation needed - `AlertManagerRuleGroup` already handles conversion.

## Related ADRs

- [ADR-013](013-monitoring-notifications.md): Notification channel selection (ntfy)
- [ADR-011](011-multi-tenant-cluster.md): Multi-tenant cluster architecture