12 KiB
Architecture Decision Record: Monitoring and Alerting Architecture
Initial Author: Willem Rolleman, Jean-Gabriel Carrier
Initial Date: March 9, 2026
Last Updated Date: March 9, 2026
Status
Accepted
Supersedes: ADR-010
Context
Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:
-
Cluster-level monitoring: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
-
Tenant-level monitoring: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
-
Application-level monitoring: Developers deploying applications want zero-config monitoring that "just works" for their services.
The monitoring landscape is fragmented:
- OKD/OpenShift: Built-in Prometheus with AlertmanagerConfig CRDs
- KubePrometheus: Helm-based stack with PrometheusRule CRDs
- RHOB (Red Hat Observability): Operator-based with MonitoringStack CRDs
- Standalone Prometheus: Raw Prometheus deployments
Each system has different CRDs, different installation methods, and different configuration APIs.
Decision
We implement a trait-based architecture with compile-time capability verification that provides:
- Type-safe abstractions via parameterized traits:
AlertReceiver<S>,AlertRule<S>,ScrapeTarget<S> - Compile-time topology compatibility via the
Observability<S>capability bound - Three levels of abstraction: Cluster, Tenant, and Application monitoring
- Pre-built alert rules as functions that return typed structs
Core Traits
// domain/topology/monitoring.rs
/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}
/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
-> Result<PreparationOutcome, PreparationError>;
async fn install_receivers(&self, sender: &S, inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
async fn install_rules(&self, sender: &S, inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
-> Result<...>;
}
Alert Sender Types
Each monitoring stack is a distinct AlertSender:
| Sender | Module | Use Case |
|---|---|---|
OpenshiftClusterAlertSender |
monitoring/okd/ |
OKD/OpenShift built-in monitoring |
KubePrometheus |
monitoring/kube_prometheus/ |
Helm-deployed kube-prometheus-stack |
Prometheus |
monitoring/prometheus/ |
Standalone Prometheus via Helm |
RedHatClusterObservability |
monitoring/red_hat_cluster_observability/ |
RHOB operator |
Grafana |
monitoring/grafana/ |
Grafana-managed alerting |
Three Levels of Monitoring
1. Cluster-Level Monitoring
For cluster administrators. Full control over monitoring infrastructure.
// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver { ... })],
rules: vec![Box::new(alert_rules)],
scrape_targets: Some(vec![Box::new(external_exporters)]),
}
Characteristics:
- Cluster-scoped CRDs and resources
- Can add external scrape targets (outside cluster)
- Manages Alertmanager configuration
- Requires cluster-admin privileges
2. Tenant-Level Monitoring
For teams confined to namespaces. The topology determines tenant context.
// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender: &KubePrometheus, ...) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or("default");
// Install rules in tenant namespace
}
}
Characteristics:
- Namespace-scoped resources
- Cannot modify cluster-level monitoring config
- May have restricted receiver types
- Runtime validation of permissions (cannot be fully compile-time)
3. Application-Level Monitoring
For developers. Zero-config, opinionated monitoring.
// modules/application/features/monitoring.rs
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}
impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// Auto-creates ServiceMonitor
// Auto-installs Ntfy for notifications
// Handles tenant namespace automatically
// Wires up sensible defaults
}
}
Characteristics:
- Automatic ServiceMonitor creation
- Opinionated notification channel (Ntfy)
- Tenant-aware via topology
- Minimal configuration required
Rationale
Why Generic Traits Instead of Unified Types?
Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:
// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }
// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }
// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }
A unified type would either:
- Be a lowest-common-denominator (loses stack-specific features)
- Be a complex union type (hard to use, easy to misconfigure)
Generic traits let each stack express its configuration naturally while providing a consistent interface.
Why Compile-Time Capability Bounds?
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore { ... }
This fails at compile time if you try to use OpenshiftClusterAlertScore with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.
Why Not a MonitoringStack Abstraction (V2 Approach)?
The V2 approach proposed a unified MonitoringStack that hides sender selection:
// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
.add_alert_channel(discord)
Problems:
- Hides which sender you're using, losing compile-time guarantees
- "Version selection" actually chooses between fundamentally different systems
- Would need to handle all stack-specific features through a generic interface
The current approach is explicit: you choose OpenshiftClusterAlertSender and the compiler verifies compatibility.
Why Runtime Validation for Tenants?
Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.
Options considered:
- Compile-time tenant markers - Would require modeling entire RBAC hierarchy in types. Over-engineering.
- Runtime validation - Current approach. Fails with clear K8s permission errors if insufficient access.
- No tenant support - Would exclude a major use case.
Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.
Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.
Consequences
Pros
- Type Safety: Invalid configurations are caught at compile time
- Extensibility: Adding a new monitoring stack requires implementing traits, not modifying core code
- Clear Separation: Cluster/Tenant/Application levels have distinct entry points
- Reusable Rules: Pre-built alert rules as functions (
high_pvc_fill_rate_over_two_days()) - CRD Accuracy: Type definitions match actual Kubernetes CRDs exactly
Cons
- Implementation Explosion:
DiscordReceiverimplementsAlertReceiver<S>for each sender type (3+ implementations) - Learning Curve: Understanding the trait hierarchy takes time
- clone_box Boilerplate: Required for trait object cloning (3 lines per impl)
Mitigations
- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
- Learning curve is documented with examples at each level
- clone_box boilerplate is minimal and copy-paste
Alternatives Considered
Unified MonitoringStack Type
See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.
Helm-Only Approach
Use HelmScore directly for each monitoring deployment. Rejected because:
- No type safety for alert rules
- Cannot compose with application features
- No tenant awareness
Separate Modules Per Use Case
Have cluster_monitoring/, tenant_monitoring/, app_monitoring/ as separate modules. Rejected because:
- Massive code duplication
- No shared abstraction for receivers/rules
- Adding a feature requires three implementations
Implementation Notes
Module Structure
modules/monitoring/
├── mod.rs # Public exports
├── alert_channel/ # Receivers (Discord, Webhook)
├── alert_rule/ # Rules and pre-built alerts
│ ├── prometheus_alert_rule.rs
│ └── alerts/ # Library of pre-built rules
│ ├── k8s/ # K8s-specific (pvc, pod, memory)
│ └── infra/ # Infrastructure (opnsense, dell)
├── okd/ # OpenshiftClusterAlertSender
├── kube_prometheus/ # KubePrometheus
├── prometheus/ # Prometheus
├── red_hat_cluster_observability/ # RHOB
├── grafana/ # Grafana
├── application_monitoring/ # Application-level scores
└── scrape_target/ # External scrape targets
Adding a New Alert Sender
- Create sender type:
pub struct MySender; impl AlertSender for MySender { ... } - Implement
Observability<MySender>for topologies that support it - Create CRD types in
crd/subdirectory - Implement
AlertReceiver<MySender>for existing receivers - Implement
AlertRule<MySender>forAlertManagerRuleGroup
Adding a New Alert Rule
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyAlert", "up == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "Service is down")
}
No trait implementation needed - AlertManagerRuleGroup already handles conversion.