Files
harmony/adr/020-monitoring-alerting-architecture.md
Jean-Gabriel Gill-Couture af6145afe3
All checks were successful
Run Check Script / check (pull_request) Successful in 1m23s
doc: monitoring module documentation
2026-03-09 18:33:35 -04:00

319 lines
12 KiB
Markdown

# Architecture Decision Record: Monitoring and Alerting Architecture
Initial Author: Willem Rolleman, Jean-Gabriel Carrier
Initial Date: March 9, 2026
Last Updated Date: March 9, 2026
## Status
Accepted
Supersedes: [ADR-010](010-monitoring-and-alerting.md)
## Context
Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:
1. **Cluster-level monitoring**: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
2. **Tenant-level monitoring**: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
3. **Application-level monitoring**: Developers deploying applications want zero-config monitoring that "just works" for their services.
The monitoring landscape is fragmented:
- **OKD/OpenShift**: Built-in Prometheus with AlertmanagerConfig CRDs
- **KubePrometheus**: Helm-based stack with PrometheusRule CRDs
- **RHOB (Red Hat Observability)**: Operator-based with MonitoringStack CRDs
- **Standalone Prometheus**: Raw Prometheus deployments
Each system has different CRDs, different installation methods, and different configuration APIs.
## Decision
We implement a **trait-based architecture with compile-time capability verification** that provides:
1. **Type-safe abstractions** via parameterized traits: `AlertReceiver<S>`, `AlertRule<S>`, `ScrapeTarget<S>`
2. **Compile-time topology compatibility** via the `Observability<S>` capability bound
3. **Three levels of abstraction**: Cluster, Tenant, and Application monitoring
4. **Pre-built alert rules** as functions that return typed structs
### Core Traits
```rust
// domain/topology/monitoring.rs
/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}
/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
-> Result<PreparationOutcome, PreparationError>;
async fn install_receivers(&self, sender: &S, inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
async fn install_rules(&self, sender: &S, inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
-> Result<...>;
}
```
### Alert Sender Types
Each monitoring stack is a distinct `AlertSender`:
| Sender | Module | Use Case |
|--------|--------|----------|
| `OpenshiftClusterAlertSender` | `monitoring/okd/` | OKD/OpenShift built-in monitoring |
| `KubePrometheus` | `monitoring/kube_prometheus/` | Helm-deployed kube-prometheus-stack |
| `Prometheus` | `monitoring/prometheus/` | Standalone Prometheus via Helm |
| `RedHatClusterObservability` | `monitoring/red_hat_cluster_observability/` | RHOB operator |
| `Grafana` | `monitoring/grafana/` | Grafana-managed alerting |
### Three Levels of Monitoring
#### 1. Cluster-Level Monitoring
For cluster administrators. Full control over monitoring infrastructure.
```rust
// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver { ... })],
rules: vec![Box::new(alert_rules)],
scrape_targets: Some(vec![Box::new(external_exporters)]),
}
```
**Characteristics:**
- Cluster-scoped CRDs and resources
- Can add external scrape targets (outside cluster)
- Manages Alertmanager configuration
- Requires cluster-admin privileges
#### 2. Tenant-Level Monitoring
For teams confined to namespaces. The topology determines tenant context.
```rust
// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender: &KubePrometheus, ...) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or("default");
// Install rules in tenant namespace
}
}
```
**Characteristics:**
- Namespace-scoped resources
- Cannot modify cluster-level monitoring config
- May have restricted receiver types
- Runtime validation of permissions (cannot be fully compile-time)
#### 3. Application-Level Monitoring
For developers. Zero-config, opinionated monitoring.
```rust
// modules/application/features/monitoring.rs
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}
impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// Auto-creates ServiceMonitor
// Auto-installs Ntfy for notifications
// Handles tenant namespace automatically
// Wires up sensible defaults
}
}
```
**Characteristics:**
- Automatic ServiceMonitor creation
- Opinionated notification channel (Ntfy)
- Tenant-aware via topology
- Minimal configuration required
## Rationale
### Why Generic Traits Instead of Unified Types?
Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:
```rust
// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }
// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }
// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }
```
A unified type would either:
1. Be a lowest-common-denominator (loses stack-specific features)
2. Be a complex union type (hard to use, easy to misconfigure)
Generic traits let each stack express its configuration naturally while providing a consistent interface.
### Why Compile-Time Capability Bounds?
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore { ... }
```
This fails at compile time if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.
### Why Not a MonitoringStack Abstraction (V2 Approach)?
The V2 approach proposed a unified `MonitoringStack` that hides sender selection:
```rust
// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
.add_alert_channel(discord)
```
**Problems:**
1. Hides which sender you're using, losing compile-time guarantees
2. "Version selection" actually chooses between fundamentally different systems
3. Would need to handle all stack-specific features through a generic interface
The current approach is explicit: you choose `OpenshiftClusterAlertSender` and the compiler verifies compatibility.
### Why Runtime Validation for Tenants?
Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.
Options considered:
1. **Compile-time tenant markers** - Would require modeling entire RBAC hierarchy in types. Over-engineering.
2. **Runtime validation** - Current approach. Fails with clear K8s permission errors if insufficient access.
3. **No tenant support** - Would exclude a major use case.
Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.
> Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.
## Consequences
### Pros
1. **Type Safety**: Invalid configurations are caught at compile time
2. **Extensibility**: Adding a new monitoring stack requires implementing traits, not modifying core code
3. **Clear Separation**: Cluster/Tenant/Application levels have distinct entry points
4. **Reusable Rules**: Pre-built alert rules as functions (`high_pvc_fill_rate_over_two_days()`)
5. **CRD Accuracy**: Type definitions match actual Kubernetes CRDs exactly
### Cons
1. **Implementation Explosion**: `DiscordReceiver` implements `AlertReceiver<S>` for each sender type (3+ implementations)
2. **Learning Curve**: Understanding the trait hierarchy takes time
3. **clone_box Boilerplate**: Required for trait object cloning (3 lines per impl)
### Mitigations
- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
- Learning curve is documented with examples at each level
- clone_box boilerplate is minimal and copy-paste
## Alternatives Considered
### Unified MonitoringStack Type
See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.
### Helm-Only Approach
Use `HelmScore` directly for each monitoring deployment. Rejected because:
- No type safety for alert rules
- Cannot compose with application features
- No tenant awareness
### Separate Modules Per Use Case
Have `cluster_monitoring/`, `tenant_monitoring/`, `app_monitoring/` as separate modules. Rejected because:
- Massive code duplication
- No shared abstraction for receivers/rules
- Adding a feature requires three implementations
## Implementation Notes
### Module Structure
```
modules/monitoring/
├── mod.rs # Public exports
├── alert_channel/ # Receivers (Discord, Webhook)
├── alert_rule/ # Rules and pre-built alerts
│ ├── prometheus_alert_rule.rs
│ └── alerts/ # Library of pre-built rules
│ ├── k8s/ # K8s-specific (pvc, pod, memory)
│ └── infra/ # Infrastructure (opnsense, dell)
├── okd/ # OpenshiftClusterAlertSender
├── kube_prometheus/ # KubePrometheus
├── prometheus/ # Prometheus
├── red_hat_cluster_observability/ # RHOB
├── grafana/ # Grafana
├── application_monitoring/ # Application-level scores
└── scrape_target/ # External scrape targets
```
### Adding a New Alert Sender
1. Create sender type: `pub struct MySender; impl AlertSender for MySender { ... }`
2. Implement `Observability<MySender>` for topologies that support it
3. Create CRD types in `crd/` subdirectory
4. Implement `AlertReceiver<MySender>` for existing receivers
5. Implement `AlertRule<MySender>` for `AlertManagerRuleGroup`
### Adding a New Alert Rule
```rust
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyAlert", "up == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "Service is down")
}
```
No trait implementation needed - `AlertManagerRuleGroup` already handles conversion.
## Related ADRs
- [ADR-013](013-monitoring-notifications.md): Notification channel selection (ntfy)
- [ADR-011](011-multi-tenant-cluster.md): Multi-tenant cluster architecture