harmony/docs/monitoring.md

# Monitoring and Alerting in Harmony

Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.

## Overview

Harmony's monitoring module supports three distinct use cases:

| Level | Who Uses It | What It Provides |
|-------|-------------|------------------|
| **Cluster** | Cluster administrators | Full control over monitoring stack, cluster-wide alerts, external scrape targets |
| **Tenant** | Platform teams | Namespace-scoped monitoring in multi-tenant environments |
| **Application** | Application developers | Zero-config monitoring that "just works" |

Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.

## Core Concepts

### AlertSender

An `AlertSender` represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:

| Sender | Description | Use When |
|--------|-------------|----------|
| `OpenshiftClusterAlertSender` | OKD/OpenShift built-in monitoring | Running on OKD/OpenShift |
| `KubePrometheus` | kube-prometheus-stack via Helm | Standard Kubernetes, need full stack |
| `Prometheus` | Standalone Prometheus | Custom Prometheus deployment |
| `RedHatClusterObservability` | RHOB operator | Red Hat managed clusters |
| `Grafana` | Grafana-managed alerting | Grafana as primary alerting layer |

### AlertReceiver

An `AlertReceiver` defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.

```rust
pub trait AlertReceiver<S: AlertSender> {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
    fn name(&self) -> String;
}
```

Built-in receivers:
- `DiscordReceiver` - Discord webhooks
- `WebhookReceiver` - Generic HTTP webhooks

### AlertRule

An `AlertRule` defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.

```rust
pub trait AlertRule<S: AlertSender> {
    fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
    fn name(&self) -> String;
}
```

### Observability Capability

Topologies implement `Observability<S>` to indicate they support a specific alert sender:

```rust
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
    async fn install_receivers(&self, sender, inventory, receivers) { ... }
    async fn install_rules(&self, sender, inventory, rules) { ... }
    // ...
}
```

This provides **compile-time verification**: if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't implement `Observability<OpenshiftClusterAlertSender>`, the code won't compile.

---

## Level 1: Cluster Monitoring

Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:
- Installing/managing the monitoring stack
- Configuring cluster-wide alert receivers
- Defining cluster-level alert rules
- Adding external scrape targets (e.g., bare-metal servers, firewalls)

### Example: OKD Cluster Alerts

```rust
use harmony::{
    modules::monitoring::{
        alert_channel::discord_alert_channel::DiscordReceiver,
        alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
        okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
        scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
    },
    topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};

let severity_matcher = AlertMatcher {
    label: "severity".to_string(),
    operator: MatchOp::Eq,
    value: "critical".to_string(),
};

let rule_group = AlertManagerRuleGroup::new(
    "cluster-rules",
    vec![high_pvc_fill_rate_over_two_days()],
);

let external_exporter = PrometheusNodeExporter {
    job_name: "firewall".to_string(),
    metrics_path: "/metrics".to_string(),
    listen_address: ip!("192.168.1.1"),
    port: 9100,
    ..Default::default()
};

harmony_cli::run(
    Inventory::autoload(),
    K8sAnywhereTopology::from_env(),
    vec![Box::new(OpenshiftClusterAlertScore {
        sender: OpenshiftClusterAlertSender,
        receivers: vec![Box::new(DiscordReceiver {
            name: "critical-alerts".to_string(),
            url: hurl!("https://discord.com/api/webhooks/..."),
            route: AlertRoute {
                matchers: vec![severity_matcher],
                ..AlertRoute::default("critical-alerts".to_string())
            },
        })],
        rules: vec![Box::new(rule_group)],
        scrape_targets: Some(vec![Box::new(external_exporter)]),
    })],
    None,
).await?;
```

### What This Does

1. **Enables cluster monitoring** - Activates OKD's built-in Prometheus
2. **Enables user workload monitoring** - Allows namespace-scoped rules
3. **Configures Alertmanager** - Adds Discord receiver with route matching
4. **Deploys alert rules** - Creates `AlertingRule` CRD with PVC fill rate alert
5. **Adds external scrape target** - Configures Prometheus to scrape the firewall

### Compile-Time Safety

The `OpenshiftClusterAlertScore` requires:

```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
    for OpenshiftClusterAlertScore
```

If `K8sAnywhereTopology` didn't implement `Observability<OpenshiftClusterAlertSender>`, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.

---

## Level 2: Tenant Monitoring

In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:

- Resources are deployed in the tenant's namespace
- Cannot modify cluster-level monitoring configuration
- The topology determines namespace context at runtime

### How It Works

The topology's `Observability` implementation handles tenant scoping:

```rust
impl Observability<KubePrometheus> for K8sAnywhereTopology {
    async fn install_rules(&self, sender, inventory, rules) {
        // Topology knows if it's tenant-scoped
        let namespace = self.get_tenant_config().await
            .map(|t| t.name)
            .unwrap_or_else(|| "monitoring".to_string());

        // Rules are installed in the appropriate namespace
        for rule in rules.unwrap_or_default() {
            let score = KubePrometheusRuleScore {
                sender: sender.clone(),
                rule,
                namespace: namespace.clone(), // Tenant namespace
            };
            score.create_interpret().execute(inventory, self).await?;
        }
    }
}
```

### Tenant vs Cluster Resources

| Resource | Cluster-Level | Tenant-Level |
|----------|---------------|--------------|
| Alertmanager config | Global receivers | Namespaced receivers (where supported) |
| PrometheusRules | Cluster-wide alerts | Namespace alerts only |
| ServiceMonitors | Any namespace | Own namespace only |
| External scrape targets | Can add | Cannot add (cluster config) |

### Runtime Validation

Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.

This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.

---

## Level 3: Application Monitoring

Application monitoring provides zero-config, opinionated monitoring for developers. Just add the `Monitoring` feature to your application and it works.

### Example

```rust
use harmony::modules::{
    application::{Application, ApplicationFeature},
    monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};

// Define your application
let my_app = MyApplication::new();

// Add monitoring as a feature
let monitoring = Monitoring {
    application: Arc::new(my_app),
    alert_receiver: vec![], // Uses defaults
};

// Install with the application
my_app.add_feature(monitoring);
```

### What Application Monitoring Provides

1. **Automatic ServiceMonitor** - Creates a ServiceMonitor for your application's pods
2. **Ntfy Notification Channel** - Auto-installs and configures Ntfy for push notifications
3. **Tenant Awareness** - Automatically scopes to the correct namespace
4. **Sensible Defaults** - Pre-configured alert routes and receivers

### Under the Hood

```rust
impl<T: Topology + Observability<Prometheus> + TenantManager>
    ApplicationFeature<T> for Monitoring
{
    async fn ensure_installed(&self, topology: &T) -> Result<...> {
        // 1. Get tenant namespace (or use app name)
        let namespace = topology.get_tenant_config().await
            .map(|ns| ns.name.clone())
            .unwrap_or_else(|| self.application.name());

        // 2. Create ServiceMonitor for the app
        let app_service_monitor = ServiceMonitor {
            metadata: ObjectMeta {
                name: Some(self.application.name()),
                namespace: Some(namespace.clone()),
                ..Default::default()
            },
            spec: ServiceMonitorSpec::default(),
        };

        // 3. Install Ntfy for notifications
        let ntfy = NtfyScore { namespace, host };
        ntfy.interpret(&Inventory::empty(), topology).await?;

        // 4. Wire up webhook receiver to Ntfy
        let ntfy_receiver = WebhookReceiver { ... };

        // 5. Execute monitoring score
        alerting_score.interpret(&Inventory::empty(), topology).await?;
    }
}
```

---

## Pre-Built Alert Rules

Harmony provides a library of common alert rules in `modules/monitoring/alert_rule/alerts/`:

### Kubernetes Alerts (`alerts/k8s/`)

```rust
use harmony::modules::monitoring::alert_rule::alerts::k8s::{
    pod::pod_failed,
    pvc::high_pvc_fill_rate_over_two_days,
    memory_usage::alert_high_memory_usage,
};

let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
    pod_failed(),
    high_pvc_fill_rate_over_two_days(),
    alert_high_memory_usage(),
]);
```

Available rules:
- `pod_failed()` - Pod in failed state
- `alert_container_restarting()` - Container restart loop
- `alert_pod_not_ready()` - Pod not ready for extended period
- `high_pvc_fill_rate_over_two_days()` - PVC will fill within 2 days
- `alert_high_memory_usage()` - Memory usage above threshold
- `alert_high_cpu_usage()` - CPU usage above threshold

### Infrastructure Alerts (`alerts/infra/`)

```rust
use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;

let rules = AlertManagerRuleGroup::new("infra-rules", vec![
    high_http_error_rate(),
]);
```

### Creating Custom Rules

```rust
use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;

pub fn my_custom_alert() -> PrometheusAlertRule {
    PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
        .for_duration("5m")
        .label("severity", "critical")
        .annotation("summary", "My service is down")
        .annotation("description", "The my-service job has been down for more than 5 minutes")
}
```

---

## Alert Receivers

### Discord Webhook

```rust
use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};

let discord = DiscordReceiver {
    name: "ops-alerts".to_string(),
    url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
    route: AlertRoute {
        receiver: "ops-alerts".to_string(),
        matchers: vec![AlertMatcher {
            label: "severity".to_string(),
            operator: MatchOp::Eq,
            value: "critical".to_string(),
        }],
        group_by: vec!["alertname".to_string()],
        repeat_interval: Some("30m".to_string()),
        continue_matching: false,
        children: vec![],
    },
};
```

### Generic Webhook

```rust
use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;

let webhook = WebhookReceiver {
    name: "custom-webhook".to_string(),
    url: hurl!("https://api.example.com/alerts"),
    route: AlertRoute::default("custom-webhook".to_string()),
};
```

---

## Adding a New Monitoring Stack

To add support for a new monitoring stack:

1. **Create the sender type** in `modules/monitoring/my_sender/mod.rs`:
   ```rust
   #[derive(Debug, Clone)]
   pub struct MySender;

   impl AlertSender for MySender {
       fn name(&self) -> String { "MySender".to_string() }
   }
   ```

2. **Define CRD types** in `modules/monitoring/my_sender/crd/`:
   ```rust
   #[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
   #[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
   pub struct MyAlertRuleSpec { ... }
   ```

3. **Implement Observability** in `domain/topology/k8s_anywhere/observability/my_sender.rs`:
   ```rust
   impl Observability<MySender> for K8sAnywhereTopology {
       async fn install_receivers(&self, sender, inventory, receivers) { ... }
       async fn install_rules(&self, sender, inventory, rules) { ... }
       // ...
   }
   ```

4. **Implement receiver conversions** for existing receivers:
   ```rust
   impl AlertReceiver<MySender> for DiscordReceiver {
       fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
           // Convert DiscordReceiver to MySender's format
       }
   }
   ```

5. **Create score types**:
   ```rust
   pub struct MySenderAlertScore {
       pub sender: MySender,
       pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
       pub rules: Vec<Box<dyn AlertRule<MySender>>>,
   }
   ```

---

## Architecture Principles

### Type Safety Over Flexibility

Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.

### Compile-Time Capability Verification

The `Observability<S>` bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.

### Explicit Over Implicit

Monitoring stacks are chosen explicitly (`OpenshiftClusterAlertSender` vs `KubePrometheus`). There's no "auto-detection" that could lead to surprising behavior.

### Three Levels, One Foundation

Cluster, tenant, and application monitoring all use the same traits (`AlertSender`, `AlertReceiver`, `AlertRule`). The difference is in how scores are constructed and how topologies interpret them.

---

## Related Documentation

- [ADR-020: Monitoring and Alerting Architecture](../adr/020-monitoring-alerting-architecture.md)
- [ADR-013: Monitoring Notifications (ntfy)](../adr/013-monitoring-notifications.md)
- [ADR-011: Multi-Tenant Cluster Architecture](../adr/011-multi-tenant-cluster.md)
- [Coding Guide](coding-guide.md)
- [Core Concepts](concepts.md)