All checks were successful
Run Check Script / check (pull_request) Successful in 1m23s
319 lines
12 KiB
Markdown
319 lines
12 KiB
Markdown
# Architecture Decision Record: Monitoring and Alerting Architecture
|
|
|
|
Initial Author: Willem Rolleman, Jean-Gabriel Carrier
|
|
|
|
Initial Date: March 9, 2026
|
|
|
|
Last Updated Date: March 9, 2026
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
Supersedes: [ADR-010](010-monitoring-and-alerting.md)
|
|
|
|
## Context
|
|
|
|
Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:
|
|
|
|
1. **Cluster-level monitoring**: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
|
|
|
|
2. **Tenant-level monitoring**: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
|
|
|
|
3. **Application-level monitoring**: Developers deploying applications want zero-config monitoring that "just works" for their services.
|
|
|
|
The monitoring landscape is fragmented:
|
|
- **OKD/OpenShift**: Built-in Prometheus with AlertmanagerConfig CRDs
|
|
- **KubePrometheus**: Helm-based stack with PrometheusRule CRDs
|
|
- **RHOB (Red Hat Observability)**: Operator-based with MonitoringStack CRDs
|
|
- **Standalone Prometheus**: Raw Prometheus deployments
|
|
|
|
Each system has different CRDs, different installation methods, and different configuration APIs.
|
|
|
|
## Decision
|
|
|
|
We implement a **trait-based architecture with compile-time capability verification** that provides:
|
|
|
|
1. **Type-safe abstractions** via parameterized traits: `AlertReceiver<S>`, `AlertRule<S>`, `ScrapeTarget<S>`
|
|
2. **Compile-time topology compatibility** via the `Observability<S>` capability bound
|
|
3. **Three levels of abstraction**: Cluster, Tenant, and Application monitoring
|
|
4. **Pre-built alert rules** as functions that return typed structs
|
|
|
|
### Core Traits
|
|
|
|
```rust
|
|
// domain/topology/monitoring.rs
|
|
|
|
/// Marker trait for systems that send alerts (Prometheus, etc.)
|
|
pub trait AlertSender: Send + Sync + std::fmt::Debug {
|
|
fn name(&self) -> String;
|
|
}
|
|
|
|
/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
|
|
/// for a specific sender type
|
|
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
|
|
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
|
|
fn name(&self) -> String;
|
|
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
|
|
}
|
|
|
|
/// Defines how an alert rule builds its PrometheusRule configuration
|
|
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
|
|
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
|
|
fn name(&self) -> String;
|
|
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
|
|
}
|
|
|
|
/// Capability that topologies implement to support monitoring
|
|
pub trait Observability<S: AlertSender> {
|
|
async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
|
|
-> Result<PreparationOutcome, PreparationError>;
|
|
async fn install_receivers(&self, sender: &S, inventory: &Inventory,
|
|
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
|
|
async fn install_rules(&self, sender: &S, inventory: &Inventory,
|
|
rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
|
|
async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
|
|
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
|
|
async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
|
|
-> Result<...>;
|
|
}
|
|
```
|
|
|
|
### Alert Sender Types
|
|
|
|
Each monitoring stack is a distinct `AlertSender`:
|
|
|
|
| Sender | Module | Use Case |
|
|
|--------|--------|----------|
|
|
| `OpenshiftClusterAlertSender` | `monitoring/okd/` | OKD/OpenShift built-in monitoring |
|
|
| `KubePrometheus` | `monitoring/kube_prometheus/` | Helm-deployed kube-prometheus-stack |
|
|
| `Prometheus` | `monitoring/prometheus/` | Standalone Prometheus via Helm |
|
|
| `RedHatClusterObservability` | `monitoring/red_hat_cluster_observability/` | RHOB operator |
|
|
| `Grafana` | `monitoring/grafana/` | Grafana-managed alerting |
|
|
|
|
### Three Levels of Monitoring
|
|
|
|
#### 1. Cluster-Level Monitoring
|
|
|
|
For cluster administrators. Full control over monitoring infrastructure.
|
|
|
|
```rust
|
|
// examples/okd_cluster_alerts/src/main.rs
|
|
OpenshiftClusterAlertScore {
|
|
sender: OpenshiftClusterAlertSender,
|
|
receivers: vec![Box::new(DiscordReceiver { ... })],
|
|
rules: vec![Box::new(alert_rules)],
|
|
scrape_targets: Some(vec![Box::new(external_exporters)]),
|
|
}
|
|
```
|
|
|
|
**Characteristics:**
|
|
- Cluster-scoped CRDs and resources
|
|
- Can add external scrape targets (outside cluster)
|
|
- Manages Alertmanager configuration
|
|
- Requires cluster-admin privileges
|
|
|
|
#### 2. Tenant-Level Monitoring
|
|
|
|
For teams confined to namespaces. The topology determines tenant context.
|
|
|
|
```rust
|
|
// The topology's Observability impl handles namespace scoping
|
|
impl Observability<KubePrometheus> for K8sAnywhereTopology {
|
|
async fn install_rules(&self, sender: &KubePrometheus, ...) {
|
|
// Topology knows if it's tenant-scoped
|
|
let namespace = self.get_tenant_config().await
|
|
.map(|t| t.name)
|
|
.unwrap_or("default");
|
|
// Install rules in tenant namespace
|
|
}
|
|
}
|
|
```
|
|
|
|
**Characteristics:**
|
|
- Namespace-scoped resources
|
|
- Cannot modify cluster-level monitoring config
|
|
- May have restricted receiver types
|
|
- Runtime validation of permissions (cannot be fully compile-time)
|
|
|
|
#### 3. Application-Level Monitoring
|
|
|
|
For developers. Zero-config, opinionated monitoring.
|
|
|
|
```rust
|
|
// modules/application/features/monitoring.rs
|
|
pub struct Monitoring {
|
|
pub application: Arc<dyn Application>,
|
|
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
|
|
}
|
|
|
|
impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
|
|
ApplicationFeature<T> for Monitoring
|
|
{
|
|
async fn ensure_installed(&self, topology: &T) -> Result<...> {
|
|
// Auto-creates ServiceMonitor
|
|
// Auto-installs Ntfy for notifications
|
|
// Handles tenant namespace automatically
|
|
// Wires up sensible defaults
|
|
}
|
|
}
|
|
```
|
|
|
|
**Characteristics:**
|
|
- Automatic ServiceMonitor creation
|
|
- Opinionated notification channel (Ntfy)
|
|
- Tenant-aware via topology
|
|
- Minimal configuration required
|
|
|
|
## Rationale
|
|
|
|
### Why Generic Traits Instead of Unified Types?
|
|
|
|
Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:
|
|
|
|
```rust
|
|
// OKD uses AlertmanagerConfig with different structure
|
|
AlertmanagerConfig { spec: { receivers: [...] } }
|
|
|
|
// RHOB uses secret references for webhook URLs
|
|
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }
|
|
|
|
// KubePrometheus uses Alertmanager CRD with different field names
|
|
Alertmanager { spec: { config: { receivers: [...] } } }
|
|
```
|
|
|
|
A unified type would either:
|
|
1. Be a lowest-common-denominator (loses stack-specific features)
|
|
2. Be a complex union type (hard to use, easy to misconfigure)
|
|
|
|
Generic traits let each stack express its configuration naturally while providing a consistent interface.
|
|
|
|
### Why Compile-Time Capability Bounds?
|
|
|
|
```rust
|
|
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
|
|
for OpenshiftClusterAlertScore { ... }
|
|
```
|
|
|
|
This fails at compile time if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.
|
|
|
|
### Why Not a MonitoringStack Abstraction (V2 Approach)?
|
|
|
|
The V2 approach proposed a unified `MonitoringStack` that hides sender selection:
|
|
|
|
```rust
|
|
// V2 approach - rejected
|
|
MonitoringStack::new(MonitoringApiVersion::V2CRD)
|
|
.add_alert_channel(discord)
|
|
```
|
|
|
|
**Problems:**
|
|
1. Hides which sender you're using, losing compile-time guarantees
|
|
2. "Version selection" actually chooses between fundamentally different systems
|
|
3. Would need to handle all stack-specific features through a generic interface
|
|
|
|
The current approach is explicit: you choose `OpenshiftClusterAlertSender` and the compiler verifies compatibility.
|
|
|
|
### Why Runtime Validation for Tenants?
|
|
|
|
Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.
|
|
|
|
Options considered:
|
|
1. **Compile-time tenant markers** - Would require modeling entire RBAC hierarchy in types. Over-engineering.
|
|
2. **Runtime validation** - Current approach. Fails with clear K8s permission errors if insufficient access.
|
|
3. **No tenant support** - Would exclude a major use case.
|
|
|
|
Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.
|
|
|
|
> Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.
|
|
|
|
## Consequences
|
|
|
|
### Pros
|
|
|
|
1. **Type Safety**: Invalid configurations are caught at compile time
|
|
2. **Extensibility**: Adding a new monitoring stack requires implementing traits, not modifying core code
|
|
3. **Clear Separation**: Cluster/Tenant/Application levels have distinct entry points
|
|
4. **Reusable Rules**: Pre-built alert rules as functions (`high_pvc_fill_rate_over_two_days()`)
|
|
5. **CRD Accuracy**: Type definitions match actual Kubernetes CRDs exactly
|
|
|
|
### Cons
|
|
|
|
1. **Implementation Explosion**: `DiscordReceiver` implements `AlertReceiver<S>` for each sender type (3+ implementations)
|
|
2. **Learning Curve**: Understanding the trait hierarchy takes time
|
|
3. **clone_box Boilerplate**: Required for trait object cloning (3 lines per impl)
|
|
|
|
### Mitigations
|
|
|
|
- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
|
|
- Learning curve is documented with examples at each level
|
|
- clone_box boilerplate is minimal and copy-paste
|
|
|
|
## Alternatives Considered
|
|
|
|
### Unified MonitoringStack Type
|
|
|
|
See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.
|
|
|
|
### Helm-Only Approach
|
|
|
|
Use `HelmScore` directly for each monitoring deployment. Rejected because:
|
|
- No type safety for alert rules
|
|
- Cannot compose with application features
|
|
- No tenant awareness
|
|
|
|
### Separate Modules Per Use Case
|
|
|
|
Have `cluster_monitoring/`, `tenant_monitoring/`, `app_monitoring/` as separate modules. Rejected because:
|
|
- Massive code duplication
|
|
- No shared abstraction for receivers/rules
|
|
- Adding a feature requires three implementations
|
|
|
|
## Implementation Notes
|
|
|
|
### Module Structure
|
|
|
|
```
|
|
modules/monitoring/
|
|
├── mod.rs # Public exports
|
|
├── alert_channel/ # Receivers (Discord, Webhook)
|
|
├── alert_rule/ # Rules and pre-built alerts
|
|
│ ├── prometheus_alert_rule.rs
|
|
│ └── alerts/ # Library of pre-built rules
|
|
│ ├── k8s/ # K8s-specific (pvc, pod, memory)
|
|
│ └── infra/ # Infrastructure (opnsense, dell)
|
|
├── okd/ # OpenshiftClusterAlertSender
|
|
├── kube_prometheus/ # KubePrometheus
|
|
├── prometheus/ # Prometheus
|
|
├── red_hat_cluster_observability/ # RHOB
|
|
├── grafana/ # Grafana
|
|
├── application_monitoring/ # Application-level scores
|
|
└── scrape_target/ # External scrape targets
|
|
```
|
|
|
|
### Adding a New Alert Sender
|
|
|
|
1. Create sender type: `pub struct MySender; impl AlertSender for MySender { ... }`
|
|
2. Implement `Observability<MySender>` for topologies that support it
|
|
3. Create CRD types in `crd/` subdirectory
|
|
4. Implement `AlertReceiver<MySender>` for existing receivers
|
|
5. Implement `AlertRule<MySender>` for `AlertManagerRuleGroup`
|
|
|
|
### Adding a New Alert Rule
|
|
|
|
```rust
|
|
pub fn my_custom_alert() -> PrometheusAlertRule {
|
|
PrometheusAlertRule::new("MyAlert", "up == 0")
|
|
.for_duration("5m")
|
|
.label("severity", "critical")
|
|
.annotation("summary", "Service is down")
|
|
}
|
|
```
|
|
|
|
No trait implementation needed - `AlertManagerRuleGroup` already handles conversion.
|
|
|
|
## Related ADRs
|
|
|
|
- [ADR-013](013-monitoring-notifications.md): Notification channel selection (ntfy)
|
|
- [ADR-011](011-multi-tenant-cluster.md): Multi-tenant cluster architecture
|