Compare commits

..

40 Commits

Author SHA1 Message Date
ccc26e07eb feat: harmony_asset crate to manage assets, local, s3, http urls, etc
Some checks failed
Run Check Script / check (pull_request) Failing after 17s
2026-03-21 11:10:51 -04:00
9a67bcc96f Merge pull request 'fix/cnpgInstallation' (#251) from fix/cnpgInstallation into master
Some checks failed
Run Check Script / check (push) Successful in 1m45s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m15s
Reviewed-on: #251
2026-03-20 21:02:53 +00:00
a377fc1404 Merge branch 'master' into fix/cnpgInstallation
All checks were successful
Run Check Script / check (pull_request) Successful in 1m44s
2026-03-20 20:56:30 +00:00
c9977fee12 fix: CI file moved
All checks were successful
Run Check Script / check (pull_request) Successful in 2m5s
2026-03-20 16:48:38 -04:00
64bf585e07 fix: remove check.sh with broken path handling and fix formatting
Some checks failed
Run Check Script / check (pull_request) Failing after 12s
2026-03-20 16:41:30 -04:00
44e2c45435 fix: flaky tests due to bad environment variables handling in harmony_config crate 2026-03-20 16:40:08 -04:00
cdccbc8939 fix: formatting and minor stuff 2026-03-20 16:34:48 -04:00
9830971d05 feat: Creat namespace in k8s client and wait for namespace ready utility functions 2026-03-20 16:15:51 -04:00
e1183ef6de feat: K8s postgresql score now ensures cnpg is installed 2026-03-20 07:02:26 -04:00
444fea81b8 docs: Fix examples cli in docs
Some checks failed
Run Check Script / check (pull_request) Failing after 12s
2026-03-19 22:52:05 -04:00
907ae04195 chore: Add book.sh script and ci.sh, moved check.sh to build/ folder
Some checks failed
Run Check Script / check (pull_request) Failing after 9s
2026-03-19 22:43:32 -04:00
64582caa64 docs: Major rehaul of documentation
Some checks failed
Run Check Script / check (pull_request) Failing after 10s
2026-03-19 22:38:55 -04:00
f5736fcc37 wip: Config and secret management merging planification and high level documentation
Some checks failed
Run Check Script / check (pull_request) Failing after 43s
2026-03-19 17:02:17 -04:00
7a1e84fb68 doc: Adr 020 on interactive harmony configuration for great UX 2026-03-18 10:40:19 -04:00
8499f4d1b7 Merge pull request 'fix: small details were preventing to re-save frontends,backends and healthchecks in opnsense UI' (#248) from fix/load-balancer-xml into master
Some checks failed
Run Check Script / check (push) Has been cancelled
Compile and package harmony_composer / package_harmony_composer (push) Has been cancelled
Reviewed-on: #248
2026-03-17 14:38:35 +00:00
231d9b878e debt: Ignore interactive tests with inquire prompts
All checks were successful
Run Check Script / check (pull_request) Successful in 1m21s
2026-03-15 11:37:31 -04:00
ee2dade0be Merge remote-tracking branch 'origin/master' into feat/brocade_assisted_setup
Some checks failed
Run Check Script / check (pull_request) Failing after 1m28s
2026-03-15 10:12:22 -04:00
aa07f4c8ad Merge pull request 'fix/dynamically_get_public_domain' (#234) from fix/dynamically_get_public_domain into master
Some checks failed
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m48s
Run Check Script / check (push) Failing after 11m1s
Reviewed-on: #234
Reviewed-by: johnride <jg@nationtech.io>
2026-03-15 14:07:25 +00:00
77bb138497 Merge remote-tracking branch 'origin/master' into fix/dynamically_get_public_domain
All checks were successful
Run Check Script / check (pull_request) Successful in 1m20s
2026-03-15 09:54:36 -04:00
a16879b1b6 Merge pull request 'fix: readded tokio retry to get ca cert for a nats cluster which was accidentally removed during a refactor' (#229) from fix/nats-ca-cert-retry into master
Some checks failed
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m1s
Run Check Script / check (push) Failing after 12m23s
Reviewed-on: #229
2026-03-15 12:36:05 +00:00
f57e6f5957 Merge pull request 'feat: add priorityClass to node_health daemonset' (#249) from feat/health_endpoint_priority_class into master
Some checks failed
Run Check Script / check (push) Successful in 1m28s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m59s
Reviewed-on: #249
2026-03-14 18:53:30 +00:00
7605d05de3 fix: opnsense fixes for st-mcd (cb1)
All checks were successful
Run Check Script / check (pull_request) Successful in 1m29s
2026-03-13 13:13:37 -04:00
b244127843 feat: add priorityClass to node_health daemonset
All checks were successful
Run Check Script / check (pull_request) Successful in 1m27s
2026-03-13 11:18:18 -04:00
67c3265286 fix: small details were preventing to re-save frontends,backends and healthchecks in opnsense UI
All checks were successful
Run Check Script / check (pull_request) Successful in 2m12s
2026-03-13 10:31:17 -04:00
d10598d01e Merge pull request 'okdload balancer using 1936 port http healthcheck' (#240) from feat/okd_loadbalancer_betterhealthcheck into master
Some checks failed
Run Check Script / check (push) Successful in 1m26s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m56s
Reviewed-on: #240
2026-03-10 17:45:51 +00:00
61ba7257d0 fix: remove broken test
All checks were successful
Run Check Script / check (pull_request) Successful in 1m22s
2026-03-10 13:40:24 -04:00
b0e9594d92 Merge branch 'master' into feat/okd_loadbalancer_betterhealthcheck
Some checks failed
Run Check Script / check (pull_request) Failing after 47s
2026-03-07 23:06:50 +00:00
bfb86f63ce fix: xml field for vlan
All checks were successful
Run Check Script / check (pull_request) Successful in 1m31s
2026-03-07 11:29:44 -05:00
d920de34cf fix: configure health_check: None for public_services
All checks were successful
Run Check Script / check (pull_request) Successful in 1m35s
2026-03-05 14:55:00 -05:00
4276b9137b fix: put the hc on private_services, not public_services
All checks were successful
Run Check Script / check (pull_request) Successful in 1m32s
2026-03-05 14:35:33 -05:00
6ab88ab8d9 Merge branch 'master' into feat/okd_loadbalancer_betterhealthcheck 2026-03-04 10:46:57 -05:00
53d0704a35 wip: okdload balancer using 1936 port http healthcheck
Some checks failed
Run Check Script / check (pull_request) Failing after 31s
2026-03-02 20:47:41 -05:00
de49e9ebcc feat: Brocade switch setup now asks questions for missing links instead of failing
Some checks failed
Run Check Script / check (pull_request) Failing after 53s
2026-02-19 10:31:47 -05:00
d8ab9d52a4 fix:broken test
All checks were successful
Run Check Script / check (pull_request) Successful in 1m0s
2026-02-17 15:34:42 -05:00
2cb7aeefc0 fix: deploys replicated postgresql with site 2 as standby
Some checks failed
Run Check Script / check (pull_request) Failing after 1m8s
2026-02-17 15:02:00 -05:00
16016febcf wip: adding impl details for deploying connected replica cluster 2026-02-16 16:22:30 -05:00
e709de531d fix: added route building to failover topology 2026-02-13 16:08:05 -05:00
6ab0f3a6ab wip 2026-02-13 15:48:24 -05:00
724ab0b888 wip: removed hardcoding and added fn to trait tlsrouter 2026-02-13 15:18:23 -05:00
8b6ce8d069 fix: readded tokio retry to get ca cert for a nats cluster which was accidentally removed during a refactor
All checks were successful
Run Check Script / check (pull_request) Successful in 1m9s
2026-02-06 09:09:01 -05:00
256 changed files with 8382 additions and 7972 deletions

View File

@@ -15,4 +15,4 @@ jobs:
uses: actions/checkout@v4
- name: Run check script
run: bash check.sh
run: bash build/check.sh

3
.gitignore vendored
View File

@@ -29,3 +29,6 @@ Cargo.lock
# Useful to create ignore folders for temp files and notes
ignore
# Generated book
book

View File

@@ -1,548 +0,0 @@
# CI and Testing Strategy for Harmony
## Executive Summary
Harmony aims to become a CNCF project, requiring a robust CI pipeline that demonstrates real-world reliability. The goal is to run **all examples** in CI, from simple k3d deployments to full HA OKD clusters on bare metal. This document provides context for designing and implementing this testing infrastructure.
---
## Project Context
### What is Harmony?
Harmony is an infrastructure automation framework that is **code-first and code-only**. Operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Key differentiators:
1. **Compile-time safety**: The type system prevents "config-is-valid-but-platform-is-wrong" errors
2. **Topology abstraction**: Write once, deploy to any environment (local k3d, OKD, bare metal, cloud)
3. **Capability-based design**: Scores declare what they need; topologies provide what they have
### Core Abstractions
| Concept | Description |
|---------|-------------|
| **Score** | Declarative description of desired state (the "what") |
| **Topology** | Logical representation of infrastructure (the "where") |
| **Capability** | A feature a topology offers (the "how") |
| **Interpret** | Execution logic connecting Score to Topology |
### Compile-Time Verification
```rust
// This compiles only if K8sAnywhereTopology provides K8sclient + HelmCommand
impl<T: Topology + K8sclient + HelmCommand> Score<T> for MyScore { ... }
// This FAILS to compile - LinuxHostTopology doesn't provide K8sclient
// (intentionally broken example for testing)
impl<T: Topology + K8sclient> Score<T> for K8sResourceScore { ... }
// error: LinuxHostTopology does not implement K8sclient
```
---
## Current Examples Inventory
### Summary Statistics
| Category | Count | CI Complexity |
|----------|-------|---------------|
| k3d-compatible | 22 | Low - single k3d cluster |
| OKD-specific | 4 | Medium - requires OKD cluster |
| Bare metal | 4 | High - requires physical infra or nested virtualization |
| Multi-cluster | 3 | High - requires multiple K8s clusters |
| No infra needed | 4 | Trivial - local only |
### Detailed Example Classification
#### Tier 1: k3d-Compatible (22 examples)
Can run on a local k3d cluster with minimal setup:
| Example | Topology | Capabilities | Special Notes |
|---------|----------|--------------|---------------|
| zitadel | K8sAnywhereTopology | K8sClient, HelmCommand | SSO/Identity |
| node_health | K8sAnywhereTopology | K8sClient | Health checks |
| public_postgres | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Needs ingress |
| openbao | K8sAnywhereTopology | K8sClient, HelmCommand | Vault alternative |
| rust | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Webapp deployment |
| cert_manager | K8sAnywhereTopology | K8sClient, CertificateManagement | TLS certificates |
| try_rust_webapp | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Full webapp |
| monitoring | K8sAnywhereTopology | K8sClient, HelmCommand, Observability | Prometheus |
| application_monitoring_with_tenant | K8sAnywhereTopology | K8sClient, HelmCommand, TenantManager, Observability | Multi-tenant |
| monitoring_with_tenant | K8sAnywhereTopology | K8sClient, HelmCommand, TenantManager, Observability | Multi-tenant |
| postgresql | K8sAnywhereTopology | K8sClient, HelmCommand | CloudNativePG |
| ntfy | K8sAnywhereTopology | K8sClient, HelmCommand | Notifications |
| tenant | K8sAnywhereTopology | K8sClient, TenantManager | Namespace isolation |
| lamp | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | LAMP stack |
| k8s_drain_node | K8sAnywhereTopology | K8sClient | Node operations |
| k8s_write_file_on_node | K8sAnywhereTopology | K8sClient | Node operations |
| remove_rook_osd | K8sAnywhereTopology | K8sClient | Ceph operations |
| validate_ceph_cluster_health | K8sAnywhereTopology | K8sClient | Ceph health |
| kube-rs | Direct kube | K8sClient | Raw kube-rs demo |
| brocade_snmp_server | K8sAnywhereTopology | K8sClient | SNMP collector |
| harmony_inventory_builder | LocalhostTopology | None | Network scanning |
| cli | LocalhostTopology | None | CLI demo |
#### Tier 2: OKD/OpenShift-Specific (4 examples)
Require OKD/OpenShift features not available in vanilla K8s:
| Example | Topology | OKD-Specific Feature |
|---------|----------|---------------------|
| okd_cluster_alerts | K8sAnywhereTopology | OpenShift Monitoring CRDs |
| operatorhub_catalog | K8sAnywhereTopology | OpenShift OperatorHub |
| rhob_application_monitoring | K8sAnywhereTopology | RHOB (Red Hat Observability) |
| nats-supercluster | K8sAnywhereTopology | OKD Routes (OpenShift Ingress) |
#### Tier 3: Bare Metal Infrastructure (4 examples)
Require physical hardware or full virtualization:
| Example | Topology | Physical Requirements |
|---------|----------|----------------------|
| okd_installation | HAClusterTopology | OPNSense, Brocade switch, PXE boot, 3+ nodes |
| okd_pxe | HAClusterTopology | OPNSense, Brocade switch, PXE infrastructure |
| sttest | HAClusterTopology | Full HA cluster with all network services |
| opnsense | OPNSenseFirewall | OPNSense firewall access |
| opnsense_node_exporter | Custom | OPNSense firewall |
#### Tier 4: Multi-Cluster (3 examples)
Require multiple K8s clusters:
| Example | Topology | Clusters Required |
|---------|----------|-------------------|
| nats | K8sAnywhereTopology × 2 | 2 clusters with NATS gateways |
| nats-module | DecentralizedTopology | 3 clusters for supercluster |
| multisite_postgres | FailoverTopology | 2 clusters for replication |
---
## Testing Categories
### 1. Compile-Time Tests
These tests verify that the type system correctly rejects invalid configurations:
```rust
// Should NOT compile - K8sResourceScore on LinuxHostTopology
#[test]
#[compile_fail]
fn test_k8s_score_on_linux_host() {
let score = K8sResourceScore::new();
let topology = LinuxHostTopology::new();
// This line should fail to compile
harmony_cli::run(Inventory::empty(), topology, vec![Box::new(score)], None);
}
// Should compile - K8sResourceScore on K8sAnywhereTopology
#[test]
fn test_k8s_score_on_k8s_topology() {
let score = K8sResourceScore::new();
let topology = K8sAnywhereTopology::from_env();
// This should compile
harmony_cli::run(Inventory::empty(), topology, vec![Box::new(score)], None);
}
```
**Implementation Options:**
- `trybuild` crate for compile-time failure tests
- Separate `tests/compile_fail/` directory with expected error messages
### 2. Unit Tests
Pure Rust logic without external dependencies:
- Score serialization/deserialization
- Inventory parsing
- Type conversions
- CRD generation
**Requirements:**
- No external services
- Sub-second execution
- Run on every PR
### 3. Integration Tests (k3d)
Deploy to a local k3d cluster:
**Setup:**
```bash
# Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
# Create cluster
k3d cluster create harmony-test \
--agents 3 \
--k3s-arg "--disable=traefik@server:0"
# Wait for ready
kubectl wait --for=condition=Ready nodes --all --timeout=120s
```
**Test Matrix:**
| Example | k3d | Test Type |
|---------|-----|-----------|
| zitadel | ✅ | Deploy + health check |
| cert_manager | ✅ | Deploy + certificate issuance |
| monitoring | ✅ | Deploy + metric collection |
| postgresql | ✅ | Deploy + database connectivity |
| tenant | ✅ | Namespace creation + isolation |
### 4. Integration Tests (OKD)
Deploy to OKD/OpenShift cluster:
**Options:**
1. **Nested virtualization**: Run OKD in VMs (slow, expensive)
2. **CRC (CodeReady Containers)**: Single-node OKD (resource intensive)
3. **Managed OpenShift**: AWS/Azure/GCP (costly)
4. **Existing cluster**: Connect to pre-provisioned cluster (fastest)
**Test Matrix:**
| Example | OKD Required | Test Type |
|---------|--------------|-----------|
| okd_cluster_alerts | ✅ | Alert rule deployment |
| rhob_application_monitoring | ✅ | RHOB stack deployment |
| operatorhub_catalog | ✅ | Operator installation |
### 5. End-to-End Tests (Full Infrastructure)
Complete infrastructure deployment including bare metal:
**Options:**
1. **Libvirt + KVM**: Virtual machines on CI runner
2. **Nested KVM**: KVM inside KVM (for cloud CI)
3. **Dedicated hardware**: Physical test lab
4. **Mock/Hybrid**: Mock physical components, real K8s
---
## CI Environment Options
### Option A: GitHub Actions (Current Standard)
**Pros:**
- Native GitHub integration
- Large runner ecosystem
- Free for open source
**Cons:**
- Limited nested virtualization support
- 6-hour job timeout
- Resource constraints on free runners
**Matrix:**
```yaml
strategy:
matrix:
os: [ubuntu-latest]
rust: [stable, beta]
k8s: [k3d, kind]
tier: [unit, k3d-integration]
```
### Option B: Self-Hosted Runners
**Pros:**
- Full control over environment
- Can run nested virtualization
- No time limits
- Persistent state between runs
**Cons:**
- Maintenance overhead
- Cost of infrastructure
- Security considerations
**Setup:**
- Bare metal servers with KVM support
- Pre-installed k3d, kind, CRC
- OPNSense VM for network tests
### Option C: Hybrid (GitHub + Self-Hosted)
**Pros:**
- Fast unit tests on GitHub runners
- Heavy tests on self-hosted infrastructure
- Cost-effective
**Cons:**
- Two CI systems to maintain
- Complexity in test distribution
### Option D: Cloud CI (CircleCI, GitLab CI, etc.)
**Pros:**
- Often better resource options
- Docker-in-Docker support
- Better nested virtualization
**Cons:**
- Cost
- Less GitHub-native
---
## Performance Requirements
### Target Execution Times
| Test Category | Target Time | Current (est.) |
|---------------|-------------|----------------|
| Compile-time tests | < 30s | Unknown |
| Unit tests | < 60s | Unknown |
| k3d integration (per example) | < 120s | 60-300s |
| Full k3d matrix | < 15 min | 30-60 min |
| OKD integration | < 30 min | 1-2 hours |
| Full E2E | < 2 hours | 4-8 hours |
### Sub-Second Performance Strategies
1. **Parallel execution**: Run independent tests concurrently
2. **Incremental testing**: Only run affected tests on changes
3. **Cached clusters**: Pre-warm k3d clusters
4. **Layered testing**: Fail fast on cheaper tests
5. **Mock external services**: Fake Discord webhooks, etc.
---
## Test Data and Secrets Management
### Secrets Required
| Secret | Use | Storage |
|--------|-----|---------|
| Discord webhook URL | Alert receiver tests | GitHub Secrets |
| OPNSense credentials | Network tests | Self-hosted only |
| Cloud provider creds | Multi-cloud tests | Vault / GitHub Secrets |
| TLS certificates | Ingress tests | Generated on-the-fly |
### Test Data
| Data | Source | Strategy |
|------|--------|----------|
| Container images | Public registries | Cache locally |
| Helm charts | Public repos | Vendor in repo |
| K8s manifests | Generated | Dynamic |
---
## Proposed Test Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ harmony_e2e_tests Package │
│ (cargo run -p harmony_e2e_tests) │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Compile │ │ Unit │ │ Compile-Fail Tests │ │
│ │ Tests │ │ Tests │ │ (trybuild) │ │
│ │ < 30s │ │ < 60s │ │ < 30s │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ k3d Integration Tests │ │
│ │ Self-provisions k3d cluster, runs 22 examples │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ zitadel │ │ cert-mgr│ │ monitor │ │ postgres│ ... │ │
│ │ │ 60s │ │ 90s │ │ 120s │ │ 90s │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ Parallel Execution │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ OKD Integration Tests │ │
│ │ Connects to existing OKD cluster or provisions via KVM │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ okd_cluster_ │ │ rhob_application_ │ │ │
│ │ │ alerts (5 min) │ │ monitoring (10 min) │ │ │
│ │ └─────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ KVM-based E2E Tests │ │
│ │ Uses Harmony's KVM module to provision test VMs │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ okd_installation│ │ Full HA cluster deployment │ │ │
│ │ │ (30-60 min) │ │ (60-120 min) │ │ │
│ │ └─────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Any CI system (GitHub Actions, GitLab CI, Jenkins, cron) just runs:
cargo run -p harmony_e2e_tests
```
┌─────────────────────────────────────────────────────────────────┐
GitHub Actions
├─────────────────────────────────────────────────────────────────┤
┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐
Compile Unit Compile-Fail Tests
Tests Tests (trybuild)
< 30s < 60s < 30s
└─────────────┘ └─────────────┘ └─────────────────────────┘
┌───────────────────────────────────────────────────────────┐
k3d Integration Tests
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
zitadel cert-mgr monitor postgres ...
60s 90s 120s 90s
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Parallel Execution
└───────────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
Self-Hosted Runners
├─────────────────────────────────────────────────────────────────┤
┌───────────────────────────────────────────────────────────┐
OKD Integration Tests
┌─────────────────┐ ┌─────────────────────────────┐
okd_cluster_ rhob_application_
alerts (5 min) monitoring (10 min)
└─────────────────┘ └─────────────────────────────┘
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
KVM-based E2E Tests (Harmony provisions)
┌─────────────────────────────────────────────────────┐
Harmony KVM Module provisions test VMs
- OKD HA Cluster (3 control plane, 2 workers)
- OPNSense VM (router/firewall)
- Brocade simulator VM
└─────────────────────────────────────────────────────┘
└───────────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────────────┘
```
---
## Questions for Researchers
### Critical Questions
1. **Self-contained test runner**: How to design `harmony_e2e_tests` package that runs all tests with a single `cargo run` command?
2. **Nested Virtualization**: What are the prerequisites for running KVM inside a test environment?
3. **Cost Optimization**: How to minimize cloud costs while running comprehensive E2E tests?
4. **Test Isolation**: How to ensure test isolation when running parallel k3d tests?
5. **State Management**: Should we persist k3d clusters between test runs, or create fresh each time?
6. **Mocking Strategy**: Which external services (Discord, OPNSense, etc.) should be mocked vs. real?
7. **Compile-Fail Tests**: Best practices for testing Rust compile-time errors?
8. **Multi-Cluster Tests**: How to efficiently provision and connect multiple K8s clusters in tests?
9. **Secrets Management**: How to handle secrets for test environments without external CI dependencies?
10. **Test Flakiness**: Strategies for reducing flakiness in infrastructure tests?
11. **Reporting**: How to present test results for complex multi-environment test matrices?
12. **Prerequisite Detection**: How to detect and validate prerequisites (Docker, k3d, KVM) before running tests?
### Research Areas
1. **CI/CD Tools**: Evaluate GitHub Actions, GitLab CI, CircleCI, Tekton, Prow for Harmony's needs
2. **K8s Test Tools**: Evaluate kind, k3d, minikube, microk8s for local testing
3. **Mock Frameworks**: Evaluate mock-server, wiremock, hoverfly for external service mocking
4. **Test Frameworks**: Evaluate built-in Rust test, nextest, cargo-tarpaulin for performance
---
## Success Criteria
### Week 1 (Agentic Velocity)
- [ ] Compile-time verification tests working
- [ ] Unit tests for monitoring module
- [ ] First 5 k3d examples running in CI
- [ ] Mock framework for Discord webhooks
### Week 2
- [ ] All 22 k3d-compatible examples in CI
- [ ] OKD self-hosted runner operational
- [ ] KVM module reviewed and ready for CI
### Week 3-4
- [ ] Full E2E tests with KVM infrastructure
- [ ] Multi-cluster tests automated
- [ ] All examples tested in CI
### Month 2
- [ ] Sub-15-minute total CI time
- [ ] Weekly E2E tests on bare metal
- [ ] Documentation complete
- [ ] Ready for CNCF submission
---
## Prerequisites
### Hardware Requirements
| Component | Minimum | Recommended |
|-----------|---------|------------|
| CPU | 4 cores | 8+ cores (for parallel tests) |
| RAM | 8 GB | 32 GB (for KVM E2E) |
| Disk | 50 GB SSD | 500 GB NVMe |
| Docker | Required | Latest |
| k3d | Required | v5.6.0 |
| Kubectl | Required | v1.28.0 |
| libvirt | Required | 9.0.0 (for KVM tests) |
### Software Requirements
| Tool | Version |
|------|---------|
| Rust | 1.75+ |
| Docker | 24.0+ |
| k3d | v5.6.0+ |
| kubectl | v1.28+ |
| libvirt | 9.0.0 |
### Installation (One-time)
```bash
# Install Rust
curl --proto '=https://sh.rustup.rs' -sSf | sh
# Install Docker
curl -fsSL https://get.docker.com -o docker-ce | sh
# Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
# Install kubectl
curl -LO "https://dl.k8s.io/release/v1.28.0/bin/linux/amd64" -o /usr/local/bin/kubectl
sudo mv /usr/local/bin/kubectl /usr/local/bin
```
---
## Reference Materials
### Existing Code
- Examples: `examples/*/src/main.rs`
- Topologies: `harmony/src/domain/topology/`
- Capabilities: `harmony/src/domain/topology/` (trait definitions)
- Scores: `harmony/src/modules/*/`
### Documentation
- [Coding Guide](docs/coding-guide.md)
- [Core Concepts](docs/concepts.md)
- [Monitoring Architecture](docs/monitoring.md)
- [ADR-020: Monitoring](adr/020-monitoring-alerting-architecture.md)
### Related Projects
- Crossplane (similar abstraction model)
- Pulumi (infrastructure as code)
- Terraform (state management patterns)
- Flux/ArgoCD (GitOps testing patterns)

View File

@@ -1,201 +0,0 @@
# Pragmatic CI and Testing Roadmap for Harmony
**Status**: Active implementation (March 2026)
**Core Principle**: Self-contained test runner — no dependency on centralized CI servers
All tests are executable via one command:
```bash
cargo run -p harmony_e2e_tests
```
The `harmony_e2e_tests` package:
- Provisions its own infrastructure when needed (k3d, KVM VMs)
- Runs all test tiers in sequence or selectively
- Reports results in text, JSON or JUnit XML
- Works identically on developer laptops, any Linux server, GitHub Actions, GitLab CI, Jenkins, cron jobs, etc.
- Is the single source of truth for what "passing CI" means
## Why This Approach
1. **Portability** — same command & behavior everywhere
2. **Harmony tests Harmony** — the framework validates itself
3. **No vendor lock-in** — GitHub Actions / GitLab CI are just triggers
4. **Perfect reproducibility** — developers reproduce any CI failure locally in seconds
5. **Offline capable** — after initial setup, most tiers run without internet
## Architecture: `harmony_e2e_tests` Package
```
harmony_e2e_tests/
├── Cargo.toml
├── src/
│ ├── main.rs # CLI entry point
│ ├── lib.rs # Test runner core logic
│ ├── tiers/
│ │ ├── mod.rs
│ │ ├── compile_fail.rs # trybuild-based compile-time checks
│ │ ├── unit.rs # cargo test --lib --workspace
│ │ ├── k3d.rs # k3d cluster + parallel example runs
│ │ ├── okd.rs # connect to existing OKD cluster
│ │ └── kvm.rs # full E2E via Harmony's own KVM module
│ ├── mocks/
│ │ ├── mod.rs
│ │ ├── discord.rs # mock Discord webhook receiver
│ │ └── opnsense.rs # mock OPNSense firewall API
│ └── infrastructure/
│ ├── mod.rs
│ ├── k3d.rs # k3d cluster lifecycle
│ └── kvm.rs # helper wrappers around KVM score
└── tests/
├── ui/ # trybuild compile-fail cases (*.rs + *.stderr)
└── fixtures/ # static test data / golden files
```
## CLI Interface ( clap-based )
```bash
# Run everything (default)
cargo run -p harmony_e2e_tests
# Specific tier
cargo run -p harmony_e2e_tests -- --tier k3d
cargo run -p harmony_e2e_tests -- --tier compile
# Filter to one example
cargo run -p harmony_e2e_tests -- --tier k3d --example monitoring
# Parallelism control (k3d tier)
cargo run -p harmony_e2e_tests -- --parallel 8
# Reporting
cargo run -p harmony_e2e_tests -- --report junit.xml
cargo run -p harmony_e2e_tests -- --format json
# Debug helpers
cargo run -p harmony_e2e_tests -- --verbose --dry-run
```
## Test Tiers Ordered by Speed & Cost
| Tier | Duration target | Runner type | What it tests | Isolation strategy |
|------------------|------------------|----------------------|----------------------------------------------------|-----------------------------|
| Compile-fail | < 20 s | Any (GitHub free) | Invalid configs don't compile | Per-file trybuild |
| Unit | < 60 s | Any | Pure Rust logic | cargo test |
| k3d | 815 min | GitHub / self-hosted | 22+ k3d-compatible examples | Fresh k3d cluster + ns-per-example |
| OKD | 1030 min | Self-hosted / CRC | OKD-specific features (Routes, Monitoring CRDs…) | Existing cluster via KUBECONFIG |
| KVM Full E2E | 60180 min | Self-hosted bare-metal | Full HA OKD install + bare-metal scenarios | Harmony KVM score provisions VMs |
### Tier Details & Implementation Notes
1. **Compile-fail**
Uses **`trybuild`** crate (standard in Rust ecosystem).
Place intentional compile errors in `tests/ui/*.rs` with matching `*.stderr` expectation files.
One test function replaces the old custom loop:
```rust
#[test]
fn ui() {
let t = trybuild::TestCases::new();
t.compile_fail("tests/ui/*.rs");
}
```
2. **Unit**
Simple wrapper: `cargo test --lib --workspace -- --nocapture`
Consider `cargo-nextest` later for 23× speedup if test count grows.
3. **k3d**
- Provisions isolated cluster once at start (`k3d cluster create --agents 3 --no-lb --disable traefik`)
- Discovers examples via `[package.metadata.harmony.test-tier = "k3d"]` in `Cargo.toml`
- Runs in parallel with tokio semaphore (default 58 slots)
- Each example gets its own namespace
- Uses `defer` / `scopeguard` for guaranteed cleanup
- Mocks Discord webhook and OPNSense API
4. **OKD**
Connects to pre-provisioned cluster via `KUBECONFIG`.
Validates it is actually OpenShift/OKD before proceeding.
5. **KVM**
Uses **Harmonys own KVM module** to provision test VMs (control-plane + workers + OPNSense).
→ True “dogfooding” — if the E2E fails, the KVM score itself is likely broken.
## CI Integration Patterns
### Fast PR validation (GitHub Actions)
```yaml
name: Fast Tests
on: [push, pull_request]
jobs:
fast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- name: Install Docker & k3d
uses: nolar/setup-k3d-k3s@v1
- run: cargo run -p harmony_e2e_tests -- --tier compile,unit,k3d --report junit.xml
- uses: actions/upload-artifact@v4
with: { name: test-results, path: junit.xml }
```
### Nightly / Merge heavy tests (self-hosted runner)
```yaml
name: Full E2E
on:
schedule: [{ cron: "0 3 * * *" }]
push: { branches: [main] }
jobs:
full:
runs-on: [self-hosted, linux, x64, kvm-capable]
steps:
- uses: actions/checkout@v4
- run: cargo run -p harmony_e2e_tests -- --tier okd,kvm --verbose --report junit.xml
```
## Prerequisites Auto-Check & Install
```rust
// in harmony_e2e_tests/src/infrastructure/prerequisites.rs
async fn ensure_k3d() -> Result<()> { … } // curl | bash if missing
async fn ensure_docker() -> Result<()> { … }
fn check_kvm_support() -> Result<()> { … } // /dev/kvm + libvirt
```
## Success Criteria
### Step 1
- [ ] `harmony_e2e_tests` package created & basic CLI working
- [ ] trybuild compile-fail suite passing
- [ ] First 810 k3d examples running reliably in CI
- [ ] Mock server for Discord webhook completed
### Step 2
- [ ] All 22 k3d-compatible examples green
- [ ] OKD tier running on dedicated self-hosted runner
- [ ] JUnit reporting + GitHub check integration
- [ ] Namespace isolation + automatic retry on transient k8s errors
### Step 3
- [ ] KVM full E2E green on bare-metal runner (nightly)
- [ ] Multi-cluster examples (nats, multisite-postgres) automated
- [ ] Total fast CI time < 12 minutes on GitHub runners
- [ ] Documentation: “How to add a new tested example”
## Quick Start for New Contributors
```bash
# One-time setup
rustup update stable
cargo install trybuild cargo-nextest # optional but recommended
# Run locally (most common)
cargo run -p harmony_e2e_tests -- --tier k3d --verbose
# Just compile checks + unit
cargo test -p harmony_e2e_tests
```

1069
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,8 +1,8 @@
[workspace]
resolver = "2"
members = [
"private_repos/*",
"examples/*",
"private_repos/*",
"harmony",
"harmony_types",
"harmony_macros",
@@ -16,13 +16,14 @@ members = [
"harmony_inventory_agent",
"harmony_secret_derive",
"harmony_secret",
"adr/agent_discovery/mdns",
"harmony_config_derive",
"harmony_config",
"brocade",
"harmony_agent",
"harmony_agent/deploy",
"harmony_node_readiness",
"harmony-k8s",
"harmony_e2e_tests",
"harmony_assets",
]
[workspace.package]
@@ -37,6 +38,7 @@ derive-new = "0.7"
async-trait = "0.1"
tokio = { version = "1.40", features = [
"io-std",
"io-util",
"fs",
"macros",
"rt-multi-thread",
@@ -73,6 +75,7 @@ base64 = "0.22.1"
tar = "0.4.44"
lazy_static = "1.5.0"
directories = "6.0.0"
futures-util = "0.3"
thiserror = "2.0.14"
serde = { version = "1.0.209", features = ["derive", "rc"] }
serde_json = "1.0.127"
@@ -86,3 +89,4 @@ reqwest = { version = "0.12", features = [
"json",
], default-features = false }
assertor = "0.0.4"
tokio-test = "0.4"

272
README.md
View File

@@ -1,101 +1,121 @@
# Harmony
Open-source infrastructure orchestration that treats your platform like first-class code.
**Infrastructure orchestration that treats your platform like first-class code.**
In other words, Harmony is a **next-generation platform engineering framework**.
Harmony is an open-source framework that brings the rigor of software engineering to infrastructure management. Write Rust code to define what you want, and Harmony handles the rest — from local development to production clusters.
_By [NationTech](https://nationtech.io)_
[![Build](https://git.nationtech.io/NationTech/harmony/actions/workflows/check.yml/badge.svg)](https://git.nationtech.io/nationtech/harmony)
[![Build](https://git.nationtech.io/NationTech/harmony/actions/workflows/check.yml/badge.svg)](https://git.nationtech.io/NationTech/harmony)
[![License](https://img.shields.io/badge/license-AGPLv3-blue?style=flat-square)](LICENSE)
### Unify
---
- **Project Scaffolding**
- **Infrastructure Provisioning**
- **Application Deployment**
- **Day-2 operations**
## The Problem Harmony Solves
All in **one strongly-typed Rust codebase**.
Modern infrastructure is messy. Your Kubernetes cluster needs monitoring. Your bare-metal servers need provisioning. Your applications need deployments. Each comes with its own tooling, its own configuration format, and its own failure modes.
### Deploy anywhere
**What if you could describe your entire platform in one consistent language?**
From a **developer laptop** to a **global production cluster**, a single **source of truth** drives the **full software lifecycle.**
That's Harmony. It unifies project scaffolding, infrastructure provisioning, application deployment, and day-2 operations into a single strongly-typed Rust codebase.
## The Harmony Philosophy
---
Infrastructure is essential, but it shouldnt be your core business. Harmony is built on three guiding principles that make modern platforms reliable, repeatable, and easy to reason about.
## Three Principles That Make the Difference
| Principle | What it means for you |
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Infrastructure as Resilient Code** | Replace sprawling YAML and bash scripts with type-safe Rust. Test, refactor, and version your platform just like application code. |
| **Prove It Works Before You Deploy** | Harmony uses the compiler to verify that your applications needs match the target environments capabilities at **compile-time**, eliminating an entire class of runtime outages. |
| **One Unified Model** | Software and infrastructure are a single system. Harmony models them together, enabling deep automation—from bare-metal servers to Kubernetes workloads—with zero context switching. |
| Principle | What It Means |
|-----------|---------------|
| **Infrastructure as Resilient Code** | Stop fighting with YAML and bash. Write type-safe Rust that you can test, version, and refactor like any other code. |
| **Prove It Works Before You Deploy** | Harmony verifies at _compile time_ that your application can actually run on your target infrastructure. No more "the config looks right but it doesn't work" surprises. |
| **One Unified Model** | Software and infrastructure are one system. Deploy from laptop to production cluster without switching contexts or tools. |
These principles surface as simple, ergonomic Rust APIs that let teams focus on their product while trusting the platform underneath.
---
## Where to Start
## How It Works: The Core Concepts
We have a comprehensive set of documentation right here in the repository.
Harmony is built around three concepts that work together:
| I want to... | Start Here |
| ----------------- | ------------------------------------------------------------------ |
| Get Started | [Getting Started Guide](./docs/guides/getting-started.md) |
| See an Example | [Use Case: Deploy a Rust Web App](./docs/use-cases/rust-webapp.md) |
| Explore | [Documentation Hub](./docs/README.md) |
| See Core Concepts | [Core Concepts Explained](./docs/concepts.md) |
### Score — "What You Want"
## Quick Look: Deploy a Rust Webapp
A `Score` is a declarative description of desired state. Think of it as a "recipe" that says _what_ you want without specifying _how_ to get there.
The snippet below spins up a complete **production-grade Rust + Leptos Webapp** with monitoring. Swap it for your own scores to deploy anything from microservices to machine-learning pipelines.
```rust
// "I want a PostgreSQL cluster running with default settings"
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
```
### Topology — "Where It Goes"
A `Topology` represents your infrastructure environment and its capabilities. It answers the question: "What can this environment actually do?"
```rust
// Deploy to a local K3D cluster, or any Kubernetes cluster via environment variables
K8sAnywhereTopology::from_env()
```
### Interpret — "How It Happens"
An `Interpret` is the execution logic that connects your `Score` to your `Topology`. It translates "what you want" into "what the infrastructure does."
**The Compile-Time Check:** Before your code ever runs, Harmony verifies that your `Score` is compatible with your `Topology`. If your application needs a feature your infrastructure doesn't provide, you get a compile error — not a runtime failure.
---
## What You Can Deploy
Harmony ships with ready-made Scores for:
**Data Services**
- PostgreSQL clusters (via CloudNativePG operator)
- Multi-site PostgreSQL with failover
**Kubernetes**
- Namespaces, Deployments, Ingress
- Helm charts
- cert-manager for TLS
- Monitoring (Prometheus, alerting, ntfy)
**Bare Metal / Infrastructure**
- OKD clusters from scratch
- OPNsense firewalls
- Network services (DNS, DHCP, TFTP)
- Brocade switch configuration
**And more:** Application deployment, tenant management, load balancing, and more.
---
## Quick Start: Deploy a PostgreSQL Cluster
This example provisions a local Kubernetes cluster (K3D) and deploys a PostgreSQL cluster on it — no external infrastructure required.
```rust
use harmony::{
inventory::Inventory,
modules::{
application::{
ApplicationScore, RustWebFramework, RustWebapp,
features::{PackagingDeployment, rhob_monitoring::Monitoring},
},
monitoring::alert_channel::discord_alert_channel::DiscordWebhook,
},
modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig},
topology::K8sAnywhereTopology,
};
use harmony_macros::hurl;
use std::{path::PathBuf, sync::Arc};
#[tokio::main]
async fn main() {
let application = Arc::new(RustWebapp {
name: "harmony-example-leptos".to_string(),
project_root: PathBuf::from(".."), // <== Your project root, usually .. if you use the standard `/harmony` folder
framework: Some(RustWebFramework::Leptos),
service_port: 8080,
});
// Define your Application deployment and the features you want
let app = ApplicationScore {
features: vec![
Box::new(PackagingDeployment {
application: application.clone(),
}),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![
Box::new(DiscordWebhook {
name: "test-discord".to_string(),
url: hurl!("https://discord.doesnt.exist.com"), // <== Get your discord webhook url
}),
],
}),
],
application,
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(), // <== Deploy to local automatically provisioned local k3d by default or connect to any kubernetes cluster
vec![Box::new(app)],
K8sAnywhereTopology::from_env(),
vec![Box::new(postgres)],
None,
)
.await
@@ -103,40 +123,128 @@ async fn main() {
}
```
To run this:
### What this actually does
- Clone the repository: `git clone https://git.nationtech.io/nationtech/harmony`
- Install dependencies: `cargo build --release`
- Run the example: `cargo run --example try_rust_webapp`
When you compile and run this program:
1. **Compiles** the Harmony Score into an executable
2. **Connects** to `K8sAnywhereTopology` — which auto-provisions a local K3D cluster if none exists
3. **Installs** the CloudNativePG operator into the cluster (one-time setup)
4. **Creates** a PostgreSQL cluster with 1 instance and 1 GiB of storage
5. **Exposes** the PostgreSQL instance as a Kubernetes Service
### Prerequisites
- [Rust](https://rust-lang.org/tools/install) (edition 2024)
- [Docker](https://docs.docker.com/get-docker/) (for the local K3D cluster)
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (optional, for inspecting the cluster)
### Run it
```bash
# Clone the repository
git clone https://git.nationtech.io/nationtech/harmony
cd harmony
# Build the project
cargo build --release
# Run the example
cargo run -p example-postgresql
```
Harmony will print its progress as it sets up the cluster and deploys PostgreSQL. When complete, you can inspect the deployment:
```bash
kubectl get pods -n harmony-postgres-example
kubectl get secret -n harmony-postgres-example harmony-postgres-example-db-user -o jsonpath='{.data.password}' | base64 -d
```
To connect to the database, forward the port:
```bash
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 5432:5432
psql -h localhost -p 5432 -U postgres
```
To clean up, delete the K3D cluster:
```bash
k3d cluster delete harmony-postgres-example
```
---
## Environment Variables
`K8sAnywhereTopology::from_env()` reads the following environment variables to determine where and how to connect:
| Variable | Default | Description |
|----------|---------|-------------|
| `KUBECONFIG` | `~/.kube/config` | Path to your kubeconfig file |
| `HARMONY_AUTOINSTALL` | `true` | Auto-provision a local K3D cluster if none found |
| `HARMONY_USE_LOCAL_K3D` | `true` | Always prefer local K3D over remote clusters |
| `HARMONY_PROFILE` | `dev` | Deployment profile: `dev`, `staging`, or `prod` |
| `HARMONY_K8S_CONTEXT` | _none_ | Use a specific kubeconfig context |
| `HARMONY_PUBLIC_DOMAIN` | _none_ | Public domain for ingress endpoints |
To connect to an existing Kubernetes cluster instead of provisioning K3D:
```bash
# Point to your kubeconfig
export KUBECONFIG=/path/to/your/kubeconfig
export HARMONY_USE_LOCAL_K3D=false
export HARMONY_AUTOINSTALL=false
# Then run
cargo run -p example-postgresql
```
---
## Documentation
All documentation is in the `/docs` directory.
| I want to... | Start here |
|--------------|------------|
| Understand the core concepts | [Core Concepts](./docs/concepts.md) |
| Deploy my first application | [Getting Started Guide](./docs/guides/getting-started.md) |
| Explore available components | [Scores Catalog](./docs/catalogs/scores.md) · [Topologies Catalog](./docs/catalogs/topologies.md) |
| See a complete bare-metal deployment | [OKD on Bare Metal](./docs/use-cases/okd-on-bare-metal.md) |
| Build my own Score or Topology | [Developer Guide](./docs/guides/developer-guide.md) |
- [Documentation Hub](./docs/README.md): The main entry point for all documentation.
- [Core Concepts](./docs/concepts.md): A detailed look at Score, Topology, Capability, Inventory, and Interpret.
- [Component Catalogs](./docs/catalogs/README.md): Discover all available Scores, Topologies, and Capabilities.
- [Developer Guide](./docs/guides/developer-guide.md): Learn how to write your own Scores and Topologies.
---
## Architectural Decision Records
## Why Rust?
- [ADR-001 · Why Rust](adr/001-rust.md)
- [ADR-003 · Infrastructure Abstractions](adr/003-infrastructure-abstractions.md)
- [ADR-006 · Secret Management](adr/006-secret-management.md)
- [ADR-011 · Multi-Tenant Cluster](adr/011-multi-tenant-cluster.md)
We chose Rust for the same reason you might: **reliability through type safety**.
## Contribute
Infrastructure code runs in production. It needs to be correct. Rust's ownership model and type system let us build a framework where:
Discussions and roadmap live in [Issues](https://git.nationtech.io/nationtech/harmony/-/issues). PRs, ideas, and feedback are welcome!
- Invalid configurations fail at compile time, not at 3 AM
- Refactoring infrastructure is as safe as refactoring application code
- The compiler verifies that your platform can actually fulfill your requirements
See [ADR-001 · Why Rust](./adr/001-rust.md) for our full rationale.
---
## Architecture Decisions
Harmony's design is documented through Architecture Decision Records (ADRs):
- [ADR-001 · Why Rust](./adr/001-rust.md)
- [ADR-003 · Infrastructure Abstractions](./adr/003-infrastructure-abstractions.md)
- [ADR-006 · Secret Management](./adr/006-secret-management.md)
- [ADR-011 · Multi-Tenant Cluster](./adr/011-multi-tenant-cluster.md)
---
## License
Harmony is released under the **GNU AGPL v3**.
> We choose a strong copyleft license to ensure the project—and every improvement to it—remains open and benefits the entire community. Fork it, enhance it, even out-innovate us; just keep it open.
> We choose a strong copyleft license to ensure the project—and every improvement to it—remains open and benefits the entire community.
See [LICENSE](LICENSE) for the full text.
---
_Made with ❤️ & 🦀 by the NationTech and the Harmony community_
_Made with ❤️ & 🦀 by NationTech and the Harmony community_

View File

@@ -1,318 +0,0 @@
# Architecture Decision Record: Monitoring and Alerting Architecture
Initial Author: Willem Rolleman, Jean-Gabriel Carrier
Initial Date: March 9, 2026
Last Updated Date: March 9, 2026
## Status
Accepted
Supersedes: [ADR-010](010-monitoring-and-alerting.md)
## Context
Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:
1. **Cluster-level monitoring**: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
2. **Tenant-level monitoring**: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
3. **Application-level monitoring**: Developers deploying applications want zero-config monitoring that "just works" for their services.
The monitoring landscape is fragmented:
- **OKD/OpenShift**: Built-in Prometheus with AlertmanagerConfig CRDs
- **KubePrometheus**: Helm-based stack with PrometheusRule CRDs
- **RHOB (Red Hat Observability)**: Operator-based with MonitoringStack CRDs
- **Standalone Prometheus**: Raw Prometheus deployments
Each system has different CRDs, different installation methods, and different configuration APIs.
## Decision
We implement a **trait-based architecture with compile-time capability verification** that provides:
1. **Type-safe abstractions** via parameterized traits: `AlertReceiver<S>`, `AlertRule<S>`, `ScrapeTarget<S>`
2. **Compile-time topology compatibility** via the `Observability<S>` capability bound
3. **Three levels of abstraction**: Cluster, Tenant, and Application monitoring
4. **Pre-built alert rules** as functions that return typed structs
### Core Traits
```rust
// domain/topology/monitoring.rs
/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}
/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
-> Result<PreparationOutcome, PreparationError>;
async fn install_receivers(&self, sender: &S, inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
async fn install_rules(&self, sender: &S, inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
-> Result<...>;
}
```
### Alert Sender Types
Each monitoring stack is a distinct `AlertSender`:
| Sender | Module | Use Case |
|--------|--------|----------|
| `OpenshiftClusterAlertSender` | `monitoring/okd/` | OKD/OpenShift built-in monitoring |
| `KubePrometheus` | `monitoring/kube_prometheus/` | Helm-deployed kube-prometheus-stack |
| `Prometheus` | `monitoring/prometheus/` | Standalone Prometheus via Helm |
| `RedHatClusterObservability` | `monitoring/red_hat_cluster_observability/` | RHOB operator |
| `Grafana` | `monitoring/grafana/` | Grafana-managed alerting |
### Three Levels of Monitoring
#### 1. Cluster-Level Monitoring
For cluster administrators. Full control over monitoring infrastructure.
```rust
// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver { ... })],
rules: vec![Box::new(alert_rules)],
scrape_targets: Some(vec![Box::new(external_exporters)]),
}
```
**Characteristics:**
- Cluster-scoped CRDs and resources
- Can add external scrape targets (outside cluster)
- Manages Alertmanager configuration
- Requires cluster-admin privileges
#### 2. Tenant-Level Monitoring
For teams confined to namespaces. The topology determines tenant context.
```rust
// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender: &KubePrometheus, ...) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or("default");
// Install rules in tenant namespace
}
}
```
**Characteristics:**
- Namespace-scoped resources
- Cannot modify cluster-level monitoring config
- May have restricted receiver types
- Runtime validation of permissions (cannot be fully compile-time)
#### 3. Application-Level Monitoring
For developers. Zero-config, opinionated monitoring.
```rust
// modules/application/features/monitoring.rs
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}
impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// Auto-creates ServiceMonitor
// Auto-installs Ntfy for notifications
// Handles tenant namespace automatically
// Wires up sensible defaults
}
}
```
**Characteristics:**
- Automatic ServiceMonitor creation
- Opinionated notification channel (Ntfy)
- Tenant-aware via topology
- Minimal configuration required
## Rationale
### Why Generic Traits Instead of Unified Types?
Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:
```rust
// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }
// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }
// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }
```
A unified type would either:
1. Be a lowest-common-denominator (loses stack-specific features)
2. Be a complex union type (hard to use, easy to misconfigure)
Generic traits let each stack express its configuration naturally while providing a consistent interface.
### Why Compile-Time Capability Bounds?
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore { ... }
```
This fails at compile time if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.
### Why Not a MonitoringStack Abstraction (V2 Approach)?
The V2 approach proposed a unified `MonitoringStack` that hides sender selection:
```rust
// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
.add_alert_channel(discord)
```
**Problems:**
1. Hides which sender you're using, losing compile-time guarantees
2. "Version selection" actually chooses between fundamentally different systems
3. Would need to handle all stack-specific features through a generic interface
The current approach is explicit: you choose `OpenshiftClusterAlertSender` and the compiler verifies compatibility.
### Why Runtime Validation for Tenants?
Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.
Options considered:
1. **Compile-time tenant markers** - Would require modeling entire RBAC hierarchy in types. Over-engineering.
2. **Runtime validation** - Current approach. Fails with clear K8s permission errors if insufficient access.
3. **No tenant support** - Would exclude a major use case.
Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.
> Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.
## Consequences
### Pros
1. **Type Safety**: Invalid configurations are caught at compile time
2. **Extensibility**: Adding a new monitoring stack requires implementing traits, not modifying core code
3. **Clear Separation**: Cluster/Tenant/Application levels have distinct entry points
4. **Reusable Rules**: Pre-built alert rules as functions (`high_pvc_fill_rate_over_two_days()`)
5. **CRD Accuracy**: Type definitions match actual Kubernetes CRDs exactly
### Cons
1. **Implementation Explosion**: `DiscordReceiver` implements `AlertReceiver<S>` for each sender type (3+ implementations)
2. **Learning Curve**: Understanding the trait hierarchy takes time
3. **clone_box Boilerplate**: Required for trait object cloning (3 lines per impl)
### Mitigations
- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
- Learning curve is documented with examples at each level
- clone_box boilerplate is minimal and copy-paste
## Alternatives Considered
### Unified MonitoringStack Type
See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.
### Helm-Only Approach
Use `HelmScore` directly for each monitoring deployment. Rejected because:
- No type safety for alert rules
- Cannot compose with application features
- No tenant awareness
### Separate Modules Per Use Case
Have `cluster_monitoring/`, `tenant_monitoring/`, `app_monitoring/` as separate modules. Rejected because:
- Massive code duplication
- No shared abstraction for receivers/rules
- Adding a feature requires three implementations
## Implementation Notes
### Module Structure
```
modules/monitoring/
├── mod.rs # Public exports
├── alert_channel/ # Receivers (Discord, Webhook)
├── alert_rule/ # Rules and pre-built alerts
│ ├── prometheus_alert_rule.rs
│ └── alerts/ # Library of pre-built rules
│ ├── k8s/ # K8s-specific (pvc, pod, memory)
│ └── infra/ # Infrastructure (opnsense, dell)
├── okd/ # OpenshiftClusterAlertSender
├── kube_prometheus/ # KubePrometheus
├── prometheus/ # Prometheus
├── red_hat_cluster_observability/ # RHOB
├── grafana/ # Grafana
├── application_monitoring/ # Application-level scores
└── scrape_target/ # External scrape targets
```
### Adding a New Alert Sender
1. Create sender type: `pub struct MySender; impl AlertSender for MySender { ... }`
2. Implement `Observability<MySender>` for topologies that support it
3. Create CRD types in `crd/` subdirectory
4. Implement `AlertReceiver<MySender>` for existing receivers
5. Implement `AlertRule<MySender>` for `AlertManagerRuleGroup`
### Adding a New Alert Rule
```rust
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyAlert", "up == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "Service is down")
}
```
No trait implementation needed - `AlertManagerRuleGroup` already handles conversion.
## Related ADRs
- [ADR-013](013-monitoring-notifications.md): Notification channel selection (ntfy)
- [ADR-011](011-multi-tenant-cluster.md): Multi-tenant cluster architecture

View File

@@ -1,21 +0,0 @@
[package]
name = "example-monitoring-v2"
edition = "2024"
version.workspace = true
readme.workspace = true
license.workspace = true
[dependencies]
harmony = { path = "../../harmony" }
harmony_cli = { path = "../../harmony_cli" }
harmony-k8s = { path = "../../harmony-k8s" }
harmony_types = { path = "../../harmony_types" }
kube = { workspace = true }
schemars = "0.8"
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }
serde_yaml = { workspace = true }
url = { workspace = true }
log = { workspace = true }
async-trait = { workspace = true }
k8s-openapi = { workspace = true }

View File

@@ -1,91 +0,0 @@
# Monitoring v2 - Improved Architecture
This example demonstrates the improved monitoring architecture that addresses the "WTF/minute" issues in the original design.
## Key Improvements
### 1. **Single AlertChannel Trait with Generic Sender**
The original design required 9-12 implementations for each alert channel (Discord, Webhook, etc.) - one for each sender type. The new design uses a single trait with generic sender parameterization:
pub trait AlertChannel<Sender: AlertSender> {
async fn install_config(&self, sender: &Sender) -> Result<Outcome, InterpretError>;
fn name(&self) -> String;
fn as_any(&self) -> &dyn std::any::Any;
}
**Benefits:**
- One Discord implementation works with all sender types
- Type safety at compile time
- No runtime dispatch overhead
### 2. **MonitoringStack Abstraction**
Instead of manually selecting CRDPrometheus vs KubePrometheus vs RHOBObservability, you now have a unified MonitoringStack that handles versioning:
let monitoring_stack = MonitoringStack::new(MonitoringApiVersion::V2CRD)
.set_namespace("monitoring")
.add_alert_channel(discord_receiver)
.set_scrape_targets(vec![...]);
**Benefits:**
- Single source of truth for monitoring configuration
- Easy to switch between monitoring versions
- Automatic version-specific configuration
### 3. **TenantMonitoringScore - True Composition**
The original monitoring_with_tenant example just put tenant and monitoring as separate items in a vec. The new design truly composes them:
let tenant_score = TenantMonitoringScore::new("test-tenant", monitoring_stack);
This creates a single score that:
- Has tenant context
- Has monitoring configuration
- Automatically installs monitoring scoped to tenant namespace
**Benefits:**
- No more "two separate things" confusion
- Automatic tenant namespace scoping
- Clear ownership: tenant owns its monitoring
### 4. **Versioned Monitoring APIs**
Clear versioning makes it obvious which monitoring stack you're using:
pub enum MonitoringApiVersion {
V1Helm, // Old Helm charts
V2CRD, // Current CRDs
V3RHOB, // RHOB (future)
}
**Benefits:**
- No guessing which API version you're using
- Easy to migrate between versions
- Backward compatibility path
## Comparison
### Original Design (monitoring_with_tenant)
- Manual selection of each component
- Manual installation of both components
- Need to remember to pass both to harmony_cli::run
- Monitoring not scoped to tenant automatically
### New Design (monitoring_v2)
- Single composed score
- One score does it all
## Usage
cd examples/monitoring_v2
cargo run
## Migration Path
To migrate from the old design to the new:
1. Replace individual alert channel implementations with AlertChannel<Sender>
2. Use MonitoringStack instead of manual *Prometheus selection
3. Use TenantMonitoringScore instead of separate TenantScore + monitoring scores
4. Select monitoring version via MonitoringApiVersion

View File

@@ -1,343 +0,0 @@
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use log::debug;
use serde::{Deserialize, Serialize};
use serde_yaml::{Mapping, Value};
use harmony::data::Version;
use harmony::interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome};
use harmony::inventory::Inventory;
use harmony::score::Score;
use harmony::topology::{Topology, tenant::TenantManager};
use harmony_k8s::K8sClient;
use harmony_types::k8s_name::K8sName;
use harmony_types::net::Url;
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
fn namespace(&self) -> String;
}
#[derive(Debug)]
pub struct CRDPrometheus {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for CRDPrometheus {
fn name(&self) -> String {
"CRDPrometheus".to_string()
}
fn namespace(&self) -> String {
self.namespace.clone()
}
}
#[derive(Debug)]
pub struct RHOBObservability {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for RHOBObservability {
fn name(&self) -> String {
"RHOBObservability".to_string()
}
fn namespace(&self) -> String {
self.namespace.clone()
}
}
#[derive(Debug)]
pub struct KubePrometheus {
pub config: Arc<Mutex<KubePrometheusConfig>>,
}
impl Default for KubePrometheus {
fn default() -> Self {
Self::new()
}
}
impl KubePrometheus {
pub fn new() -> Self {
Self {
config: Arc::new(Mutex::new(KubePrometheusConfig::new())),
}
}
}
impl AlertSender for KubePrometheus {
fn name(&self) -> String {
"KubePrometheus".to_string()
}
fn namespace(&self) -> String {
self.config.lock().unwrap().namespace.clone().unwrap_or_else(|| "monitoring".to_string())
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct KubePrometheusConfig {
pub namespace: Option<String>,
#[serde(skip)]
pub alert_receiver_configs: Vec<AlertManagerChannelConfig>,
}
impl KubePrometheusConfig {
pub fn new() -> Self {
Self {
namespace: None,
alert_receiver_configs: Vec::new(),
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AlertManagerChannelConfig {
pub channel_receiver: serde_yaml::Value,
pub channel_route: serde_yaml::Value,
}
impl Default for AlertManagerChannelConfig {
fn default() -> Self {
Self {
channel_receiver: serde_yaml::Value::Mapping(Default::default()),
channel_route: serde_yaml::Value::Mapping(Default::default()),
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ScrapeTargetConfig {
pub service_name: String,
pub port: String,
pub path: String,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum MonitoringApiVersion {
V1Helm,
V2CRD,
V3RHOB,
}
#[derive(Debug, Clone)]
pub struct MonitoringStack {
pub version: MonitoringApiVersion,
pub namespace: String,
pub alert_channels: Vec<Arc<dyn AlertSender>>,
pub scrape_targets: Vec<ScrapeTargetConfig>,
}
impl MonitoringStack {
pub fn new(version: MonitoringApiVersion) -> Self {
Self {
version,
namespace: "monitoring".to_string(),
alert_channels: Vec::new(),
scrape_targets: Vec::new(),
}
}
pub fn set_namespace(mut self, namespace: &str) -> Self {
self.namespace = namespace.to_string();
self
}
pub fn add_alert_channel(mut self, channel: impl AlertSender + 'static) -> Self {
self.alert_channels.push(Arc::new(channel));
self
}
pub fn set_scrape_targets(mut self, targets: Vec<(&str, &str, String)>) -> Self {
self.scrape_targets = targets
.into_iter()
.map(|(name, port, path)| ScrapeTargetConfig {
service_name: name.to_string(),
port: port.to_string(),
path,
})
.collect();
self
}
}
pub trait AlertChannel<Sender: AlertSender> {
fn install_config(&self, sender: &Sender);
fn name(&self) -> String;
}
#[derive(Debug, Clone)]
pub struct DiscordWebhook {
pub name: K8sName,
pub url: Url,
pub selectors: Vec<HashMap<String, String>>,
}
impl DiscordWebhook {
fn get_config(&self) -> AlertManagerChannelConfig {
let mut route = Mapping::new();
route.insert(
Value::String("receiver".to_string()),
Value::String(self.name.to_string()),
);
route.insert(
Value::String("matchers".to_string()),
Value::Sequence(vec![Value::String("alertname!=Watchdog".to_string())]),
);
let mut receiver = Mapping::new();
receiver.insert(
Value::String("name".to_string()),
Value::String(self.name.to_string()),
);
let mut discord_config = Mapping::new();
discord_config.insert(
Value::String("webhook_url".to_string()),
Value::String(self.url.to_string()),
);
receiver.insert(
Value::String("discord_configs".to_string()),
Value::Sequence(vec![Value::Mapping(discord_config)]),
);
AlertManagerChannelConfig {
channel_receiver: Value::Mapping(receiver),
channel_route: Value::Mapping(route),
}
}
}
impl AlertChannel<CRDPrometheus> for DiscordWebhook {
fn install_config(&self, sender: &CRDPrometheus) {
debug!("Installing Discord webhook for CRDPrometheus in namespace: {}", sender.namespace());
debug!("Config: {:?}", self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
}
impl AlertChannel<RHOBObservability> for DiscordWebhook {
fn install_config(&self, sender: &RHOBObservability) {
debug!("Installing Discord webhook for RHOBObservability in namespace: {}", sender.namespace());
debug!("Config: {:?}", self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"webhook-receiver".to_string()
}
}
impl AlertChannel<KubePrometheus> for DiscordWebhook {
fn install_config(&self, sender: &KubePrometheus) {
debug!("Installing Discord webhook for KubePrometheus in namespace: {}", sender.namespace());
let config = sender.config.lock().unwrap();
let ns = config.namespace.clone().unwrap_or_else(|| "monitoring".to_string());
debug!("Namespace: {}", ns);
let mut config = sender.config.lock().unwrap();
config.alert_receiver_configs.push(self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
}
fn default_monitoring_stack() -> MonitoringStack {
MonitoringStack::new(MonitoringApiVersion::V2CRD)
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TenantMonitoringScore {
pub tenant_id: harmony_types::id::Id,
pub tenant_name: String,
#[serde(skip)]
#[serde(default = "default_monitoring_stack")]
pub monitoring_stack: MonitoringStack,
}
impl TenantMonitoringScore {
pub fn new(tenant_name: &str, monitoring_stack: MonitoringStack) -> Self {
Self {
tenant_id: harmony_types::id::Id::default(),
tenant_name: tenant_name.to_string(),
monitoring_stack,
}
}
}
impl<T: Topology + TenantManager> Score<T> for TenantMonitoringScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(TenantMonitoringInterpret {
score: self.clone(),
})
}
fn name(&self) -> String {
format!("{} monitoring [TenantMonitoringScore]", self.tenant_name)
}
}
#[derive(Debug)]
pub struct TenantMonitoringInterpret {
pub score: TenantMonitoringScore,
}
#[async_trait::async_trait]
impl<T: Topology + TenantManager> Interpret<T> for TenantMonitoringInterpret {
async fn execute(
&self,
_inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
let tenant_config = topology.get_tenant_config().await.unwrap();
let tenant_ns = tenant_config.name.clone();
match self.score.monitoring_stack.version {
MonitoringApiVersion::V1Helm => {
debug!("Installing Helm monitoring for tenant {}", tenant_ns);
}
MonitoringApiVersion::V2CRD => {
debug!("Installing CRD monitoring for tenant {}", tenant_ns);
}
MonitoringApiVersion::V3RHOB => {
debug!("Installing RHOB monitoring for tenant {}", tenant_ns);
}
}
Ok(Outcome::success(format!(
"Installed monitoring stack for tenant {} with version {:?}",
self.score.tenant_name,
self.score.monitoring_stack.version
)))
}
fn get_name(&self) -> InterpretName {
InterpretName::Custom("TenantMonitoringInterpret")
}
fn get_version(&self) -> Version {
Version::from("1.0.0").unwrap()
}
fn get_status(&self) -> InterpretStatus {
InterpretStatus::SUCCESS
}
fn get_children(&self) -> Vec<harmony_types::id::Id> {
Vec::new()
}
}

9
book.toml Normal file
View File

@@ -0,0 +1,9 @@
[book]
title = "Harmony"
description = "Infrastructure orchestration that treats your platform like first-class code"
src = "docs"
build-dir = "book"
authors = ["NationTech"]
[output.html]
mathjax-support = false

11
build/book.sh Executable file
View File

@@ -0,0 +1,11 @@
#!/bin/sh
set -e
cd "$(dirname "$0")/.."
cargo install mdbook --locked
mdbook build
test -f book/index.html || (echo "ERROR: book/index.html not found" && exit 1)
test -f book/concepts.html || (echo "ERROR: book/concepts.html not found" && exit 1)
test -f book/guides/getting-started.html || (echo "ERROR: book/guides/getting-started.html not found" && exit 1)

View File

@@ -1,6 +1,8 @@
#!/bin/sh
set -e
cd "$(dirname "$0")/.."
rustc --version
cargo check --all-targets --all-features --keep-going
cargo fmt --check

16
build/ci.sh Executable file
View File

@@ -0,0 +1,16 @@
#!/bin/sh
set -e
cd "$(dirname "$0")/.."
BRANCH="${1:-main}"
echo "=== Running CI for branch: $BRANCH ==="
echo "--- Checking code ---"
./build/check.sh
echo "--- Building book ---"
./build/book.sh
echo "=== CI passed ==="

View File

@@ -13,8 +13,8 @@ If you're new to Harmony, start here:
See how to use Harmony to solve real-world problems.
- [**PostgreSQL on Local K3D**](./use-cases/postgresql-on-local-k3d.md): Deploy a production-grade PostgreSQL cluster on a local K3D cluster. The fastest way to get started.
- [**OKD on Bare Metal**](./use-cases/okd-on-bare-metal.md): A detailed walkthrough of bootstrapping a high-availability OKD cluster from physical hardware.
- [**Deploy a Rust Web App**](./use-cases/deploy-rust-webapp.md): A quick guide to deploying a monitored, containerized web application to a Kubernetes cluster.
## 3. Component Catalogs
@@ -31,16 +31,7 @@ Ready to build your own components? These guides show you how.
- [**Writing a Score**](./guides/writing-a-score.md): Learn how to create your own `Score` and `Interpret` logic to define a new desired state.
- [**Writing a Topology**](./guides/writing-a-topology.md): Learn how to model a new environment (like AWS, GCP, or custom hardware) as a `Topology`.
- [**Adding Capabilities**](./guides/adding-capabilities.md): See how to add a `Capability` to your custom `Topology`.
- [**Coding Guide**](./coding-guide.md): Conventions and best practices for writing Harmony code.
## 5. Module Documentation
## 5. Architecture Decision Records
Deep dives into specific Harmony modules and features.
- [**Monitoring and Alerting**](./monitoring.md): Comprehensive guide to cluster, tenant, and application-level monitoring with support for OKD, KubePrometheus, RHOB, and more.
## 6. Architecture Decision Records
Important architectural decisions are documented in the `adr/` directory:
- [Full ADR Index](../adr/)
Harmony's design is documented through Architecture Decision Records (ADRs). See the [ADR Overview](./adr/README.md) for a complete index of all decisions.

53
docs/SUMMARY.md Normal file
View File

@@ -0,0 +1,53 @@
# Summary
[Harmony Documentation](./README.md)
- [Core Concepts](./concepts.md)
- [Getting Started Guide](./guides/getting-started.md)
## Use Cases
- [PostgreSQL on Local K3D](./use-cases/postgresql-on-local-k3d.md)
- [OKD on Bare Metal](./use-cases/okd-on-bare-metal.md)
## Component Catalogs
- [Scores Catalog](./catalogs/scores.md)
- [Topologies Catalog](./catalogs/topologies.md)
- [Capabilities Catalog](./catalogs/capabilities.md)
## Developer Guides
- [Developer Guide](./guides/developer-guide.md)
- [Writing a Score](./guides/writing-a-score.md)
- [Writing a Topology](./guides/writing-a-topology.md)
- [Adding Capabilities](./guides/adding-capabilities.md)
## Configuration
- [Configuration](./concepts/configuration.md)
## Architecture Decision Records
- [ADR Overview](./adr/README.md)
- [000 · ADR Template](./adr/000-ADR-Template.md)
- [001 · Why Rust](./adr/001-rust.md)
- [002 · Hexagonal Architecture](./adr/002-hexagonal-architecture.md)
- [003 · Infrastructure Abstractions](./adr/003-infrastructure-abstractions.md)
- [004 · iPXE](./adr/004-ipxe.md)
- [005 · Interactive Project](./adr/005-interactive-project.md)
- [006 · Secret Management](./adr/006-secret-management.md)
- [007 · Default Runtime](./adr/007-default-runtime.md)
- [008 · Score Display Formatting](./adr/008-score-display-formatting.md)
- [009 · Helm and Kustomize Handling](./adr/009-helm-and-kustomize-handling.md)
- [010 · Monitoring and Alerting](./adr/010-monitoring-and-alerting.md)
- [011 · Multi-Tenant Cluster](./adr/011-multi-tenant-cluster.md)
- [012 · Project Delivery Automation](./adr/012-project-delivery-automation.md)
- [013 · Monitoring Notifications](./adr/013-monitoring-notifications.md)
- [015 · Higher Order Topologies](./adr/015-higher-order-topologies.md)
- [016 · Harmony Agent and Global Mesh](./adr/016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md)
- [017-1 · NATS Clusters Interconnection](./adr/017-1-Nats-Clusters-Interconnection-Topology.md)
- [018 · Template Hydration for Workload Deployment](./adr/018-Template-Hydration-For-Workload-Deployment.md)
- [019 · Network Bond Setup](./adr/019-Network-bond-setup.md)
- [020 · Interactive Configuration Crate](./adr/020-interactive-configuration-crate.md)
- [020-1 · Zitadel + OpenBao Secure Config Store](./adr/020-1-zitadel-openbao-secure-config-store.md)

View File

@@ -2,7 +2,7 @@
## Status
Proposed
Rejected : See ADR 020 ./020-interactive-configuration-crate.md
### TODO [#3](https://git.nationtech.io/NationTech/harmony/issues/3):

View File

@@ -0,0 +1,233 @@
# ADR 020-1: Zitadel OIDC and OpenBao Integration for the Config Store
Author: Jean-Gabriel Gill-Couture
Date: 2026-03-18
## Status
Proposed
## Context
ADR 020 defines a unified `harmony_config` crate with a `ConfigStore` trait. The default team-oriented backend is OpenBao, which provides encrypted storage, versioned KV, audit logging, and fine-grained access control.
OpenBao requires authentication. The question is how developers authenticate without introducing new credentials to manage.
The goals are:
- **Zero new credentials.** Developers log in with their existing corporate identity (Google Workspace, GitHub, or Microsoft Entra ID / Azure AD).
- **Headless compatibility.** The flow must work over SSH, inside containers, and in CI — environments with no browser or localhost listener.
- **Minimal friction.** After a one-time login, authentication should be invisible for weeks of active use.
- **Centralized offboarding.** Revoking a user in the identity provider must immediately revoke their access to the config store.
## Decision
Developers authenticate to OpenBao through a two-step process: first, they obtain an OIDC token from Zitadel (`sso.nationtech.io`) using the OAuth 2.0 Device Authorization Grant (RFC 8628); then, they exchange that token for a short-lived OpenBao client token via OpenBao's JWT auth method.
### The authentication flow
#### Step 1: Trigger
The `ConfigManager` attempts to resolve a value via the `StoreSource`. The `StoreSource` checks for a cached OpenBao token in `~/.local/share/harmony/session.json`. If the token is missing or expired, authentication begins.
#### Step 2: Device Authorization Request
Harmony sends a `POST` to Zitadel's device authorization endpoint:
```
POST https://sso.nationtech.io/oauth/v2/device_authorization
Content-Type: application/x-www-form-urlencoded
client_id=<harmony_client_id>&scope=openid email profile offline_access
```
Zitadel responds with:
```json
{
"device_code": "dOcbPeysDhT26ZatRh9n7Q",
"user_code": "GQWC-FWFK",
"verification_uri": "https://sso.nationtech.io/device",
"verification_uri_complete": "https://sso.nationtech.io/device?user_code=GQWC-FWFK",
"expires_in": 300,
"interval": 5
}
```
#### Step 3: User prompt
Harmony prints the code and URL to the terminal:
```
[Harmony] To authenticate, open your browser to:
https://sso.nationtech.io/device
and enter code: GQWC-FWFK
Or visit: https://sso.nationtech.io/device?user_code=GQWC-FWFK
```
If a desktop environment is detected, Harmony also calls `open` / `xdg-open` to launch the browser automatically. The `verification_uri_complete` URL pre-fills the code, so the user only needs to click "Confirm" after logging in.
There is no localhost HTTP listener. The CLI does not need to bind a port or receive a callback. This is what makes the device flow work over SSH, in containers, and through corporate firewalls — unlike the `oc login` approach which spins up a temporary web server to catch a redirect.
#### Step 4: User login
The developer logs in through Zitadel's web UI using one of the configured identity providers:
- **Google Workspace** — for teams using Google as their corporate identity.
- **GitHub** — for open-source or GitHub-centric teams.
- **Microsoft Entra ID (Azure AD)** — for enterprise clients, particularly common in Quebec and the broader Canadian public sector.
Zitadel federates the login to the chosen provider. The developer authenticates with their existing corporate credentials. No new password is created.
#### Step 5: Polling
While the user is authenticating in the browser, Harmony polls Zitadel's token endpoint at the interval specified in the device authorization response (typically 5 seconds):
```
POST https://sso.nationtech.io/oauth/v2/token
Content-Type: application/x-www-form-urlencoded
grant_type=urn:ietf:params:oauth:grant-type:device_code
&device_code=dOcbPeysDhT26ZatRh9n7Q
&client_id=<harmony_client_id>
```
Before the user completes login, Zitadel responds with `authorization_pending`. Once the user consents, Zitadel returns:
```json
{
"access_token": "...",
"token_type": "Bearer",
"expires_in": 3600,
"refresh_token": "...",
"id_token": "eyJhbGciOiJSUzI1NiIs..."
}
```
The `scope=offline_access` in the initial request is what causes Zitadel to issue a `refresh_token`.
#### Step 6: OpenBao JWT exchange
Harmony sends the `id_token` (a JWT signed by Zitadel) to OpenBao's JWT auth method:
```
POST https://secrets.nationtech.io/v1/auth/jwt/login
Content-Type: application/json
{
"role": "harmony-developer",
"jwt": "eyJhbGciOiJSUzI1NiIs..."
}
```
OpenBao validates the JWT:
1. It fetches Zitadel's public keys from `https://sso.nationtech.io/oauth/v2/keys` (the JWKS endpoint).
2. It verifies the JWT signature.
3. It reads the claims (`email`, `groups`, and any custom claims mapped from the upstream identity provider, such as Azure AD tenant or Google Workspace org).
4. It evaluates the claims against the `bound_claims` and `bound_audiences` configured on the `harmony-developer` role.
5. If validation passes, OpenBao returns a client token:
```json
{
"auth": {
"client_token": "hvs.CAES...",
"policies": ["harmony-dev"],
"metadata": { "role": "harmony-developer" },
"lease_duration": 14400,
"renewable": true
}
}
```
Harmony caches the OpenBao token, the OIDC refresh token, and the token expiry timestamps to `~/.local/share/harmony/session.json` with `0600` file permissions.
### OpenBao storage structure
All configuration and secret state is stored in an OpenBao Versioned KV v2 engine.
Path taxonomy:
```
harmony/<organization>/<project>/<environment>/<key>
```
Examples:
```
harmony/nationtech/my-app/staging/PostgresConfig
harmony/nationtech/my-app/production/PostgresConfig
harmony/nationtech/my-app/local-shared/PostgresConfig
```
The `ConfigClass` (Standard vs. Secret) can influence OpenBao policy structure — for example, `Secret`-class paths could require stricter ACLs or additional audit backends — but the path taxonomy itself does not change. This is an operational concern configured in OpenBao policies, not a structural one enforced by path naming.
### Token lifecycle and silent refresh
The system manages three tokens with different lifetimes:
| Token | TTL | Max TTL | Purpose |
|---|---|---|---|
| OpenBao client token | 4 hours | 24 hours | Read/write config store |
| OIDC ID token | 1 hour | — | Exchange for OpenBao token |
| OIDC refresh token | 90 days absolute, 30 days inactivity | — | Obtain new ID tokens silently |
The refresh flow, from the developer's perspective:
1. **Same session (< 4 hours since last use).** The cached OpenBao token is still valid. No network call to Zitadel. Fastest path.
2. **Next day (OpenBao token expired, refresh token valid).** Harmony uses the OIDC `refresh_token` to request a new `id_token` from Zitadel's token endpoint (`grant_type=refresh_token`). It then exchanges the new `id_token` for a fresh OpenBao token. This happens silently. The developer sees no prompt.
3. **OpenBao token near max TTL (approaching 24 hours of cumulative renewals).** Instead of renewing, Harmony re-authenticates using the refresh token to get a completely fresh OpenBao token. Transparent to the user.
4. **After 30 days of inactivity.** The OIDC refresh token expires. Harmony falls back to the device flow (Step 2 above) and prompts the user to re-authenticate in the browser. This is the only scenario where a returning developer sees a login prompt.
5. **User offboarded.** An administrator revokes the user's account or group membership in Zitadel. The next time the refresh token is used, Zitadel rejects it. The device flow also fails because the user can no longer authenticate. Access is terminated without any action needed on the OpenBao side.
OpenBao token renewal uses the `/auth/token/renew-self` endpoint with the `X-Vault-Token` header. Harmony renews proactively at ~75% of the TTL to avoid race conditions.
### OpenBao role configuration
The OpenBao JWT auth role for Harmony developers:
```bash
bao write auth/jwt/config \
oidc_discovery_url="https://sso.nationtech.io" \
bound_issuer="https://sso.nationtech.io"
bao write auth/jwt/role/harmony-developer \
role_type="jwt" \
bound_audiences="<harmony_client_id>" \
user_claim="email" \
groups_claim="urn:zitadel:iam:org:project:roles" \
policies="harmony-dev" \
ttl="4h" \
max_ttl="24h" \
token_type="service"
```
The `bound_audiences` claim ties the role to the specific Harmony Zitadel application. The `groups_claim` allows mapping Zitadel project roles to OpenBao policies for per-team or per-project access control.
### Self-hosted deployments
For organizations running their own infrastructure, the same architecture applies. The operator deploys Zitadel and OpenBao using Harmony's existing `ZitadelScore` and `OpenbaoScore`. The only configuration needed is three environment variables (or their equivalents in the bootstrap config):
- `HARMONY_SSO_URL` — the Zitadel instance URL.
- `HARMONY_SECRETS_URL` — the OpenBao instance URL.
- `HARMONY_SSO_CLIENT_ID` — the Zitadel application client ID.
None of these are secrets. They can be committed to an infrastructure repository or distributed via any convenient channel.
## Consequences
### Positive
- Developers authenticate with existing corporate credentials. No new passwords, no static tokens to distribute.
- The device flow works in every environment: local terminal, SSH, containers, CI runners, corporate VPNs.
- Silent token refresh keeps developers authenticated for weeks without any manual intervention.
- User offboarding is a single action in Zitadel. No OpenBao token rotation or manual revocation required.
- Azure AD / Microsoft Entra ID support addresses the enterprise and public sector market.
### Negative
- The OAuth state machine (device code polling, token refresh, error handling) adds implementation complexity compared to a static token approach.
- Developers must have network access to `sso.nationtech.io` and `secrets.nationtech.io` to pull or push configuration state. True offline work falls back to the local file store, which does not sync with the team.
- The first login per machine requires a browser interaction. Fully headless first-run scenarios (e.g., a fresh CI runner with no pre-seeded tokens) must use `EnvSource` overrides or a service account JWT.

View File

@@ -0,0 +1,177 @@
# ADR 020: Unified Configuration and Secret Management
Author: Jean-Gabriel Gill-Couture
Date: 2026-03-18
## Status
Proposed
## Context
Harmony's orchestration logic depends on runtime data that falls into two categories:
1. **Secrets** — credentials, tokens, private keys.
2. **Operational configuration** — deployment targets, host selections, port assignments, reboot decisions, and similar contextual choices.
Both categories share the same fundamental lifecycle: a value must be acquired before execution can proceed, it may come from several backends (environment variable, remote store, interactive prompt), and it must be shareable across a team without polluting the Git repository.
Treating these categories as separate subsystems forces developers to choose between a "config API" and a "secret API" at every call site. The only meaningful difference between the two is how the storage backend handles the data (plaintext vs. encrypted, audited vs. unaudited) and how the CLI displays it (visible vs. masked). That difference belongs in the backend, not in the application code.
Three concrete problems drive this change:
- **Async terminal corruption.** `inquire` prompts assume exclusive terminal ownership. Background tokio tasks emitting log output during a prompt corrupt the terminal state. This is inherent to Harmony's concurrent orchestration model.
- **Untestable code paths.** Any function containing an inline `inquire` call requires a real TTY to execute. Unit testing is impossible without ignoring the test entirely.
- **No backend integration.** Inline prompts cannot be answered from a remote store, an environment variable, or a CI pipeline. Every automated deployment that passes through a prompting code path requires a human operator at a terminal.
## Decision
A single workspace crate, `harmony_config`, provides all configuration and secret acquisition for Harmony. It replaces both `harmony_secret` and all inline `inquire` usage.
### Schema in Git, state in the store
The Rust type system serves as the configuration schema. Developers declare what configuration is needed by defining structs:
```rust
#[derive(Config, Serialize, Deserialize, JsonSchema, InteractiveParse)]
struct PostgresConfig {
pub host: String,
pub port: u16,
#[config(secret)]
pub password: String,
}
```
These structs live in Git and evolve with the code. When a branch introduces a new field, Git tracks that schema change. The actual values live in an external store — OpenBao by default. No `.env` files, no JSON config files, no YAML in the repository.
### Data classification
```rust
/// Tells the storage backend how to handle the data.
pub enum ConfigClass {
/// Plaintext storage is acceptable.
Standard,
/// Must be encrypted at rest, masked in UI, subject to audit logging.
Secret,
}
```
Classification is determined at the struct level. A struct with no `#[config(secret)]` fields has `ConfigClass::Standard`. A struct with one or more `#[config(secret)]` fields is elevated to `ConfigClass::Secret`. The struct is always stored as a single cohesive JSON blob; field-level splitting across backends is not a concern of the trait.
The `#[config(secret)]` attribute also instructs the `PromptSource` to mask terminal input for that field during interactive prompting.
### The Config trait
```rust
pub trait Config: Serialize + DeserializeOwned + JsonSchema + InteractiveParseObj + Sized {
/// Stable lookup key. By default, the struct name.
const KEY: &'static str;
/// How the backend should treat this data.
const CLASS: ConfigClass;
}
```
A `#[derive(Config)]` proc macro generates the implementation. The macro inspects field attributes to determine `CLASS`.
### The ConfigStore trait
```rust
#[async_trait]
pub trait ConfigStore: Send + Sync {
async fn get(
&self,
class: ConfigClass,
namespace: &str,
key: &str,
) -> Result<Option<serde_json::Value>, ConfigError>;
async fn set(
&self,
class: ConfigClass,
namespace: &str,
key: &str,
value: &serde_json::Value,
) -> Result<(), ConfigError>;
}
```
The `class` parameter is a hint. The store implementation decides what to do with it. An OpenBao store may route `Secret` data to a different path prefix or apply stricter ACLs. A future store could split fields across backends — that is an implementation concern, not a trait concern.
### Resolution chain
The `ConfigManager` tries sources in priority order:
1. **`EnvSource`** — reads `HARMONY_CONFIG_{KEY}` as a JSON string. Override hatch for CI/CD pipelines and containerized environments.
2. **`StoreSource`** — wraps a `ConfigStore` implementation. For teams, this is the OpenBao backend authenticated via Zitadel OIDC (see ADR 020-1).
3. **`PromptSource`** — presents an `interactive-parse` prompt on the terminal. Acquires a process-wide async mutex before rendering to prevent log output corruption.
When `PromptSource` obtains a value, the `ConfigManager` persists it back to the `StoreSource` so that subsequent runs — by the same developer or any teammate — resolve without prompting.
Callers that do not include `PromptSource` in their source list never block on a TTY. Test code passes empty source lists and constructs config structs directly.
### Schema versioning
The Rust struct is the schema. When a developer renames a field, removes a field, or changes a type on a branch, the store may still contain data shaped for a previous version of the struct. If another team member who does not yet have that commit runs the code, `serde_json::from_value` will fail on the stale entry.
In the initial implementation, the resolution chain handles this gracefully: a deserialization failure is treated as a cache miss, and the `PromptSource` fires. The prompted value overwrites the stale entry in the store.
This is sufficient for small teams working on short-lived branches. It is not sufficient at scale, where silent re-prompting could mask real configuration drift.
A future iteration will introduce a compile-time schema migration mechanism, similar to how `sqlx` verifies queries against a live database at compile time. The mechanism will:
- Detect schema drift between the Rust struct and the stored JSON.
- Apply named, ordered migration functions to transform stored data forward.
- Reject ambiguous migrations at compile time rather than silently corrupting state.
Until that mechanism exists, teams should treat store entries as soft caches: the struct definition is always authoritative, and the store is best-effort.
## Rationale
**Why merge secrets and config into one crate?** Separate crates with nearly identical trait shapes (`Secret` vs `Config`, `SecretStore` vs `ConfigStore`) force developers to make a classification decision at every call site. A unified crate with a `ConfigClass` discriminator moves that decision to the struct definition, where it belongs.
**Why OpenBao as the default backend?** OpenBao is a fully open-source Vault fork under the Linux Foundation. It runs on-premises with no phone-home requirement — a hard constraint for private cloud and regulated environments. Harmony already deploys OpenBao for clients (`OpenbaoScore`), so no new infrastructure is introduced.
**Why not store values in Git (e.g., encrypted YAML)?** Git-tracked config files create merge conflicts, require re-encryption on team membership changes, and leak metadata (file names, key names) even when values are encrypted. Storing state in OpenBao avoids all of these issues and provides audit logging, access control, and versioned KV out of the box.
**Why keep `PromptSource`?** Removing interactive prompts entirely would break the zero-infrastructure bootstrapping path and eliminate human-confirmation safety gates for destructive operations (interface reconfiguration, node reboot). The problem was never that prompts exist — it is that they were unavoidable and untestable. Making `PromptSource` an explicit, opt-in entry in the source list restores control.
## Consequences
### Positive
- A single API surface for all runtime data acquisition.
- All currently-ignored tests become runnable without TTY access.
- Async terminal corruption is eliminated by the process-wide prompt mutex.
- The bootstrapping path requires no infrastructure for a first run; `PromptSource` alone is sufficient.
- The team path (OpenBao + Zitadel) reuses infrastructure Harmony already deploys.
- User offboarding is a single Zitadel action.
### Negative
- Migrating all inline `inquire` and `harmony_secret` call sites is a significant refactoring effort.
- Until the schema migration mechanism is built, store entries for renamed or removed fields become stale and must be re-prompted.
- The Zitadel device flow introduces a browser step on first login per machine.
## Implementation Plan
### Phase 1: Trait design and crate restructure
Refactor `harmony_config` to define the final `Config`, `ConfigClass`, and `ConfigStore` traits. Update the derive macro to support `#[config(secret)]` and generate the correct `CLASS` constant. Implement `EnvSource` and `PromptSource` against the new traits. Write comprehensive unit tests using mock stores.
### Phase 2: Absorb `harmony_secret`
Migrate the `OpenbaoSecretStore`, `InfisicalSecretStore`, and `LocalFileSecretStore` implementations from `harmony_secret` into `harmony_config` as `ConfigStore` backends. Update all call sites that use `SecretManager::get`, `SecretManager::get_or_prompt`, or `SecretManager::set` to use `harmony_config` equivalents.
### Phase 3: Migrate inline prompts
Replace all inline `inquire` call sites in the `harmony` crate (`infra/brocade.rs`, `infra/network_manager.rs`, `modules/okd/host_network.rs`, and others) with `harmony_config` structs and `get_or_prompt` calls. Un-ignore the affected tests.
### Phase 4: Zitadel and OpenBao integration
Implement the authentication flow described in ADR 020-1. Wire `StoreSource` to use Zitadel OIDC tokens for OpenBao access. Implement token caching and silent refresh.
### Phase 5: Remove `harmony_secret`
Delete the `harmony_secret` and `harmony_secret_derive` crates from the workspace. All functionality now lives in `harmony_config`.

63
docs/adr/README.md Normal file
View File

@@ -0,0 +1,63 @@
# Architecture Decision Records
An Architecture Decision Record (ADR) documents a significant architectural decision made during the development of Harmony — along with its context, rationale, and consequences.
## Why We Use ADRs
As a platform engineering framework used by a team, Harmony accumulates technical decisions over time. ADRs help us:
- **Track rationale** — understand _why_ a decision was made, not just _what_ was decided
- ** onboard new contributors** — the "why" is preserved even when team membership changes
- **Avoid repeating past mistakes** — previous decisions and their context are searchable
- **Manage technical debt** — ADRs make it easier to revisit and revise past choices
An ADR captures a decision at a point in time. It is not a specification — it is a record of reasoning.
## ADR Format
Every ADR follows this structure:
| Section | Purpose |
|---------|---------|
| **Status** | Proposed / Pending / Accepted / Implemented / Deprecated |
| **Context** | The problem or background — the "why" behind this decision |
| **Decision** | The chosen solution or direction |
| **Rationale** | Reasoning behind the decision |
| **Consequences** | Both positive and negative outcomes |
| **Alternatives considered** | Other options that were evaluated |
| **Additional Notes** | Supplementary context, links, or open questions |
## ADR Index
| Number | Title | Status |
|--------|-------|--------|
| [000](./000-ADR-Template.md) | ADR Template | Reference |
| [001](./001-rust.md) | Why Rust | Accepted |
| [002](./002-hexagonal-architecture.md) | Hexagonal Architecture | Accepted |
| [003](./003-infrastructure-abstractions.md) | Infrastructure Abstractions | Accepted |
| [004](./004-ipxe.md) | iPXE | Accepted |
| [005](./005-interactive-project.md) | Interactive Project | Proposed |
| [006](./006-secret-management.md) | Secret Management | Accepted |
| [007](./007-default-runtime.md) | Default Runtime | Accepted |
| [008](./008-score-display-formatting.md) | Score Display Formatting | Proposed |
| [009](./009-helm-and-kustomize-handling.md) | Helm and Kustomize Handling | Accepted |
| [010](./010-monitoring-and-alerting.md) | Monitoring and Alerting | Accepted |
| [011](./011-multi-tenant-cluster.md) | Multi-Tenant Cluster | Accepted |
| [012](./012-project-delivery-automation.md) | Project Delivery Automation | Proposed |
| [013](./013-monitoring-notifications.md) | Monitoring Notifications | Accepted |
| [015](./015-higher-order-topologies.md) | Higher Order Topologies | Proposed |
| [016](./016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md) | Harmony Agent and Global Mesh | Proposed |
| [017-1](./017-1-Nats-Clusters-Interconnection-Topology.md) | NATS Clusters Interconnection Topology | Proposed |
| [018](./018-Template-Hydration-For-Workload-Deployment.md) | Template Hydration for Workload Deployment | Proposed |
| [019](./019-Network-bond-setup.md) | Network Bond Setup | Proposed |
| [020-1](./020-1-zitadel-openbao-secure-config-store.md) | Zitadel + OpenBao Secure Config Store | Accepted |
| [020](./020-interactive-configuration-crate.md) | Interactive Configuration Crate | Proposed |
## Contributing
When making a significant technical change:
1. **Check existing ADRs** — the decision may already be documented
2. **Create a new ADR** using the [template](./000-ADR-Template.md) if the change warrants architectural discussion
3. **Set status to Proposed** and open it for team review
4. Once accepted and implemented, update the status accordingly

View File

@@ -84,7 +84,7 @@ Network services that run inside the cluster or as part of the topology.
- **OKDLoadBalancerScore**: Configures the high-availability load balancers for the OKD API and ingress.
- **OKDBootstrapLoadBalancerScore**: Configures the load balancer specifically for the bootstrap-time API endpoint.
- **K8sIngressScore**: Configures an Ingress controller or resource.
- [HighAvailabilityHostNetworkScore](../../harmony/src/modules/okd/host_network.rs): Configures network bonds on a host and the corresponding port-channels on the switch stack for high-availability.
- **HighAvailabilityHostNetworkScore**: Configures network bonds on a host and the corresponding port-channels on the switch stack for high-availability.
## Tenant Management

View File

@@ -1,299 +0,0 @@
# Harmony Coding Guide
Harmony is an infrastructure automation framework. It is **code-first and code-only**: operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Good code here means a good operator experience.
### Concrete context
We use here the context of the KVM module to explain the coding style. This will make it very easy to understand and should translate quite well to other modules/contexts managed by Harmony like OPNSense and Kubernetes.
## Core Philosophy
### The Careful Craftsman Principle
Harmony is a powerful framework that does a lot. With that power comes responsibility. Every abstraction, every trait, every module must earn its place. Before adding anything, ask:
1. **Does this solve a real problem users have?** Not a theoretical problem, an actual one encountered in production.
2. **Is this the simplest solution that works?** Complexity is a cost that compounds over time.
3. **Will this make the next developer's life easier or harder?** Code is read far more often than written.
When in doubt, don't abstract. Wait for the pattern to emerge from real usage. A little duplication is better than the wrong abstraction.
### High-level functions over raw primitives
Callers should not need to know about underlying protocols, XML schemas, or API quirks. A function that deploys a VM should accept meaningful parameters like CPU count, memory, and network name — not XML strings.
```rust
// Bad: caller constructs XML and passes it to a thin wrapper
let xml = format!(r#"<domain type='kvm'>...</domain>"#, name, memory_kb, ...);
executor.create_vm(&xml).await?;
// Good: caller describes intent, the module handles representation
executor.define_vm(&VmConfig::builder("my-vm")
.cpu(4)
.memory_gb(8)
.disk(DiskConfig::new(50))
.network(NetworkRef::named("mylan"))
.boot_order([BootDevice::Network, BootDevice::Disk])
.build())
.await?;
```
The module owns the XML, the virsh invocations, the API calls — not the caller.
### Use the right abstraction layer
Prefer native library bindings over shelling out to CLI tools. The `virt` crate provides direct libvirt bindings and should be used instead of spawning `virsh` subprocesses.
- CLI subprocess calls are fragile: stdout/stderr parsing, exit codes, quoting, PATH differences
- Native bindings give typed errors, no temp files, no shell escaping
- `virt::connect::Connect` opens a connection; `virt::domain::Domain` manages VMs; `virt::network::Network` manages virtual networks
### Keep functions small and well-named
Each function should do one thing. If a function is doing two conceptually separate things, split it. Function names should read like plain English: `ensure_network_active`, `define_vm`, `vm_is_running`.
### Prefer short modules over large files
Group related types and functions by concept. A module that handles one resource (e.g., network, domain, storage) is better than a single file for everything.
---
## Error Handling
### Use `thiserror` for all error types
Define error types with `thiserror::Error`. This removes the boilerplate of implementing `Display` and `std::error::Error` by hand, keeps error messages close to their variants, and makes types easy to extend.
```rust
// Bad: hand-rolled Display + std::error::Error
#[derive(Debug)]
pub enum KVMError {
ConnectionError(String),
VMNotFound(String),
}
impl std::fmt::Display for KVMError { ... }
impl std::error::Error for KVMError {}
// Good: derive Display via thiserror
#[derive(thiserror::Error, Debug)]
pub enum KVMError {
#[error("connection failed: {0}")]
ConnectionFailed(String),
#[error("VM not found: {name}")]
VmNotFound { name: String },
}
```
### Make bubbling errors easy with `?` and `From`
`?` works on any error type for which there is a `From` impl. Add `From` conversions from lower-level errors into your module's error type so callers can use `?` without boilerplate.
With `thiserror`, wrapping a foreign error is one line:
```rust
#[derive(thiserror::Error, Debug)]
pub enum KVMError {
#[error("libvirt error: {0}")]
Libvirt(#[from] virt::error::Error),
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
}
```
This means a call that returns `virt::error::Error` can be `?`-propagated into a `Result<_, KVMError>` without any `.map_err(...)`.
### Typed errors over stringly-typed errors
Avoid `Box<dyn Error>` or `String` as error return types in library code. Callers need to distinguish errors programmatically — `KVMError::VmAlreadyExists` is actionable, `"VM already exists: foo"` as a `String` is not.
At binary entry points (e.g., `main`) it is acceptable to convert to `String` or `anyhow::Error` for display.
---
## Logging
### Use the `log` crate macros
All log output must go through the `log` crate. Never use `println!`, `eprintln!`, or `dbg!` in library code. This makes output compatible with any logging backend (env_logger, tracing, structured logging, etc.).
```rust
// Bad
println!("Creating VM: {}", name);
// Good
use log::{info, debug, warn};
info!("Creating VM: {name}");
debug!("VM XML:\n{xml}");
warn!("Network already active, skipping creation");
```
Use the right level:
| Level | When to use |
|---------|-------------|
| `error` | Unrecoverable failures (before returning Err) |
| `warn` | Recoverable issues, skipped steps |
| `info` | High-level progress events visible in normal operation |
| `debug` | Detailed operational info useful for debugging |
| `trace` | Very granular, per-iteration or per-call data |
Log before significant operations and after unexpected conditions. Do not log inside tight loops at `info` level.
---
## Types and Builders
### Derive `Serialize` on all public domain types
All public structs and enums that represent configuration or state should derive `serde::Serialize`. Add `Deserialize` when round-trip serialization is needed.
### Builder pattern for complex configs
When a type has more than three fields or optional fields, provide a builder. The builder pattern allows named, incremental construction without positional arguments.
```rust
let config = VmConfig::builder("bootstrap")
.cpu(4)
.memory_gb(8)
.disk(DiskConfig::new(50).labeled("os"))
.disk(DiskConfig::new(100).labeled("data"))
.network(NetworkRef::named("harmonylan"))
.boot_order([BootDevice::Network, BootDevice::Disk])
.build();
```
### Avoid `pub` fields on config structs
Expose data through methods or the builder, not raw field access. This preserves the ability to validate, rename, or change representation without breaking callers.
---
## Async
### Use `tokio` for all async runtime needs
All async code runs on tokio. Use `tokio::spawn`, `tokio::time`, etc. Use `#[async_trait]` for traits with async methods.
### No blocking in async context
Never call blocking I/O (file I/O, network, process spawn) directly in an async function. Use `tokio::fs`, `tokio::process`, or `tokio::task::spawn_blocking` as appropriate.
---
## Module Structure
### Follow the `Score` / `Interpret` pattern
Modules that represent deployable infrastructure should implement `Score<T: Topology>` and `Interpret<T>`:
- `Score` is the serializable, clonable configuration declaring *what* to deploy
- `Interpret` does the actual work when `execute()` is called
```rust
pub struct KvmScore {
network: NetworkConfig,
vms: Vec<VmConfig>,
}
impl<T: Topology + KvmHost> Score<T> for KvmScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(KvmInterpret::new(self.clone()))
}
fn name(&self) -> String { "KvmScore".to_string() }
}
```
### Flatten the public API in `mod.rs`
Internal submodules are implementation detail. Re-export what callers need at the module root:
```rust
// modules/kvm/mod.rs
mod connection;
mod domain;
mod network;
mod error;
mod xml;
pub use connection::KvmConnection;
pub use domain::{VmConfig, VmConfigBuilder, VmStatus, DiskConfig, BootDevice};
pub use error::KvmError;
pub use network::NetworkConfig;
```
---
## Commit Style
Follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/):
```
feat(kvm): add network isolation support
fix(kvm): correct memory unit conversion for libvirt
refactor(kvm): replace virsh subprocess calls with virt crate bindings
docs: add coding guide
```
Keep pull requests small and single-purpose (under ~200 lines excluding generated code). Do not mix refactoring, bug fixes, and new features in one PR.
---
## When to Add Abstractions
Harmony provides powerful abstraction mechanisms: traits, generics, the Score/Interpret pattern, and capabilities. Use them judiciously.
### Add an abstraction when:
- **You have three or more concrete implementations** doing the same thing. Two is often coincidence; three is a pattern.
- **The abstraction provides compile-time safety** that prevents real bugs (e.g., capability bounds on topologies).
- **The abstraction hides genuine complexity** that callers shouldn't need to understand (e.g., XML schema generation for libvirt).
### Don't add an abstraction when:
- **It's just to avoid a few lines of boilerplate**. Copy-paste is sometimes better than a trait hierarchy.
- **You're anticipating future flexibility** that isn't needed today. YAGNI (You Aren't Gonna Need It).
- **The abstraction makes the code harder to understand** for someone unfamiliar with the codebase.
- **You're wrapping a single implementation**. A trait with one implementation is usually over-engineering.
### Signs you've over-abstracted:
- You need to explain the type system to a competent Rust developer for them to understand how to add a simple feature.
- Adding a new concrete type requires changes in multiple trait definitions.
- The word "factory" or "manager" appears in your type names.
- You have more trait definitions than concrete implementations.
### The Rule of Three for Traits
Before creating a new trait, ensure you have:
1. A clear, real use case (not hypothetical)
2. At least one concrete implementation
3. A plan for how callers will use it
Only generalize when the pattern is proven. The monitoring module is a good example: we had multiple alert senders (OKD, KubePrometheus, RHOB) before we introduced the `AlertSender` and `AlertReceiver<S>` traits. The traits emerged from real needs, not design sessions.
---
## Documentation
### Document the "why", not the "what"
Code should be self-explanatory for the "what". Comments and documentation should explain intent, rationale, and gotchas.
```rust
// Bad: restates the code
// Returns the number of VMs
fn vm_count(&self) -> usize { self.vms.len() }
// Good: explains the why
// Returns 0 if connection is lost, rather than erroring,
// because monitoring code uses this for health checks
fn vm_count(&self) -> usize { self.vms.len() }
```
### Keep examples in the `examples/` directory
Working code beats documentation. Every major feature should have a runnable example that demonstrates real usage.

View File

@@ -28,6 +28,11 @@ Harmony's design is based on a few key concepts. Understanding them is the key t
- **What it is:** An **Inventory** is the physical material (the "what") used in a cluster. This is most relevant for bare-metal or on-premise topologies.
- **Example:** A list of nodes with their roles (control plane, worker), CPU, RAM, and network interfaces. For the `K8sAnywhereTopology`, the inventory might be empty or autoloaded, as the infrastructure is more abstract.
### 6. Configuration & Secrets
- **What it is:** Configuration represents the runtime data required to deploy your `Scores`. This includes both non-sensitive state (like cluster hostnames, deployment profiles) and sensitive secrets (like API keys, database passwords).
- **How it works:** See the [Configuration Concept Guide](./concepts/configuration.md) to understand Harmony's unified approach to managing schema in Git and state in OpenBao.
---
### How They Work Together (The Compile-Time Check)

View File

@@ -0,0 +1,107 @@
# Configuration and Secrets
Harmony treats configuration and secrets as a single concern. Developers use one crate, `harmony_config`, to declare, store, and retrieve all runtime data — whether it is a public hostname or a database password.
## The mental model: schema in Git, state in the store
### Schema
In Harmony, the Rust code is the configuration schema. You declare what your module needs by defining a struct:
```rust
#[derive(Config, Serialize, Deserialize, JsonSchema, InteractiveParse)]
struct PostgresConfig {
pub host: String,
pub port: u16,
#[config(secret)]
pub password: String,
}
```
This struct is tracked in Git. When a branch adds a new field, Git tracks that the branch requires a new value. When a branch removes a field, the old value in the store becomes irrelevant. The struct is always authoritative.
### State
The actual values live in a config store — by default, OpenBao. No `.env` files, no JSON, no YAML in the repository.
When you run your code, Harmony reads the struct (schema) and resolves values from the store (state):
- If the store has the value, it is injected seamlessly.
- If the store does not have it, Harmony prompts you in the terminal. Your answer is pushed back to the store automatically.
- When a teammate runs the same code, they are not prompted — you already provided the value.
### How branch switching works
Because the schema is just Rust code tracked in Git, branch switching works naturally:
1. You check out `feat/redis`. The code now requires `RedisConfig`.
2. You run `cargo run`. Harmony detects that `RedisConfig` has no value in the store. It prompts you.
3. You provide the values. Harmony pushes them to OpenBao.
4. Your teammate checks out `feat/redis` and runs `cargo run`. No prompt — the values are already in the store.
5. You switch back to `main`. `RedisConfig` does not exist in that branch's code. The store entry is ignored.
## Secrets vs. standard configuration
From your application code, there is no difference. You always call `harmony_config::get_or_prompt::<T>()`.
The difference is in the struct definition:
```rust
// Standard config — stored in plaintext, displayed during prompting.
#[derive(Config)]
struct ClusterConfig {
pub api_url: String,
pub namespace: String,
}
// Contains a secret field — the entire struct is stored encrypted,
// and the password field is masked during terminal prompting.
#[derive(Config)]
struct DatabaseConfig {
pub host: String,
#[config(secret)]
pub password: String,
}
```
If a struct contains any `#[config(secret)]` field, Harmony elevates the entire struct to `ConfigClass::Secret`. The storage backend decides what that means in practice — in the case of OpenBao, it may route the data to a path with stricter ACLs or audit policies.
## Authentication and team sharing
Harmony uses Zitadel (hosted at `sso.nationtech.io`) for identity and OpenBao (hosted at `secrets.nationtech.io`) for storage.
**First run on a new machine:**
1. Harmony detects that you are not logged in.
2. It prints a short code and URL to your terminal, and opens your browser if possible.
3. You log in with your corporate identity (Google, GitHub, or Microsoft Entra ID / Azure AD).
4. Harmony receives an OIDC token, exchanges it for an OpenBao token, and caches the session locally.
**Subsequent runs:**
- Harmony silently refreshes your tokens in the background. You do not need to log in again for up to 90 days of active use.
- If you are inactive for 30 days, or if an administrator revokes your access in Zitadel, you will be prompted to re-authenticate.
**Offboarding:**
Revoking a user in Zitadel immediately invalidates their ability to refresh tokens or obtain new ones. No manual secret rotation is required.
## Resolution chain
When Harmony resolves a config value, it tries sources in order:
1. **Environment variable** (`HARMONY_CONFIG_{KEY}`) — highest priority. Use this in CI/CD to override any value without touching the store.
2. **Config store** (OpenBao for teams, local file for solo/offline use) — the primary source for shared team state.
3. **Interactive prompt** — last resort. Prompts the developer and persists the answer back to the store.
## Schema versioning
The Rust struct is the single source of truth for what configuration looks like. If a developer renames or removes a field on a branch, the store may still contain data shaped for the old version of the struct. When another developer who does not have that change runs the code, deserialization will fail.
In the current implementation, this is handled gracefully: a deserialization failure is treated as a miss, and Harmony re-prompts. The new answer overwrites the stale entry.
A compile-time migration mechanism is planned for a future release to handle this more rigorously at scale.
## Offline and local development
If you are working offline or evaluating Harmony without a team OpenBao instance, the `StoreSource` falls back to a local file store at `~/.local/share/harmony/config/`. The developer experience is identical — prompting, caching, and resolution all work the same way. The only difference is that the state is local to your machine and not shared with teammates.

View File

@@ -0,0 +1,135 @@
# Adding Capabilities
`Capabilities` are trait methods that a `Topology` exposes to Scores. They are the "how" — the specific APIs and features that let a Score translate intent into infrastructure actions.
## How Capabilities Work
When a Score declares it needs certain Capabilities:
```rust
impl<T: Topology + K8sclient + HelmCommand> Score<T> for MyScore {
// ...
}
```
The compiler verifies that the target `Topology` implements both `K8sclient` and `HelmCommand`. If it doesn't, compilation fails. This is the compile-time safety check that prevents invalid configurations from reaching production.
## Built-in Capabilities
Harmony provides a set of standard Capabilities:
| Capability | What it provides |
|------------|------------------|
| `K8sclient` | A Kubernetes API client |
| `HelmCommand` | A configured `helm` CLI invocation |
| `TlsRouter` | TLS certificate management |
| `NetworkManager` | Host network configuration |
| `SwitchClient` | Network switch configuration |
| `CertificateManagement` | Certificate issuance via cert-manager |
## Implementing a Capability
Capabilities are implemented as trait methods on your Topology:
```rust
use std::sync::Arc;
use harmony_k8s::K8sClient;
use harmony::topology::K8sclient;
pub struct MyTopology {
kubeconfig: Option<String>,
}
#[async_trait]
impl K8sclient for MyTopology {
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
let client = match &self.kubeconfig {
Some(path) => K8sClient::from_kubeconfig(path).await?,
None => K8sClient::try_default().await?,
};
Ok(Arc::new(client))
}
}
```
## Adding a Custom Capability
For specialized infrastructure needs, add your own Capability as a trait:
```rust
use async_trait::async_trait;
use crate::executors::ExecutorError;
/// A capability for configuring network switches
#[async_trait]
pub trait SwitchClient: Send + Sync {
async fn configure_port(
&self,
switch: &str,
port: &str,
vlan: u16,
) -> Result<(), ExecutorError>;
async fn configure_port_channel(
&self,
switch: &str,
name: &str,
ports: &[&str],
) -> Result<(), ExecutorError>;
}
```
Then implement it on your Topology:
```rust
use harmony_infra::brocade::BrocadeClient;
pub struct MyTopology {
switch_client: Arc<dyn SwitchClient>,
}
impl SwitchClient for MyTopology {
async fn configure_port(&self, switch: &str, port: &str, vlan: u16) -> Result<(), ExecutorError> {
self.switch_client.configure_port(switch, port, vlan).await
}
async fn configure_port_channel(&self, switch: &str, name: &str, ports: &[&str]) -> Result<(), ExecutorError> {
self.switch_client.configure_port_channel(switch, name, ports).await
}
}
```
Now Scores that need `SwitchClient` can run on `MyTopology`.
## Capability Composition
Topologies often compose multiple Capabilities to support complex Scores:
```rust
pub struct HAClusterTopology {
pub kubeconfig: Option<String>,
pub router: Arc<dyn Router>,
pub load_balancer: Arc<dyn LoadBalancer>,
pub switch_client: Arc<dyn SwitchClient>,
pub dhcp_server: Arc<dyn DhcpServer>,
pub dns_server: Arc<dyn DnsServer>,
// ...
}
impl K8sclient for HAClusterTopology { ... }
impl HelmCommand for HAClusterTopology { ... }
impl SwitchClient for HAClusterTopology { ... }
impl DhcpServer for HAClusterTopology { ... }
impl DnsServer for HAClusterTopology { ... }
impl Router for HAClusterTopology { ... }
impl LoadBalancer for HAClusterTopology { ... }
```
A Score that needs all of these can run on `HAClusterTopology` because the Topology provides all of them.
## Best Practices
- **Keep Capabilities focused** — one Capability per concern (Kubernetes client, Helm, switch config)
- **Return meaningful errors** — use specific error types so Scores can handle failures appropriately
- **Make Capabilities optional where sensible** — not every Topology needs every Capability; use `Option<T>` or a separate trait for optional features
- **Document preconditions** — if a Capability requires the infrastructure to be in a specific state, document it in the trait doc comments

View File

@@ -0,0 +1,40 @@
# Developer Guide
This section covers how to extend Harmony by building your own `Score`, `Topology`, and `Capability` implementations.
## Writing a Score
A `Score` is a declarative description of desired state. To create your own:
1. Define a struct that represents your desired state
2. Implement the `Score<T>` trait, where `T` is your target `Topology`
3. Implement the `Interpret<T>` trait to define how the Score translates to infrastructure actions
See the [Writing a Score](./writing-a-score.md) guide for a step-by-step walkthrough.
## Writing a Topology
A `Topology` models your infrastructure environment. To create your own:
1. Define a struct that holds your infrastructure configuration
2. Implement the `Topology` trait
3. Implement the `Capability` traits your Score needs
See the [Writing a Topology](./writing-a-topology.md) guide for details.
## Adding Capabilities
`Capabilities` are the specific APIs or features a `Topology` exposes. They are the bridge between Scores and the actual infrastructure.
See the [Adding Capabilities](./adding-capabilities.md) guide for details on implementing and exposing Capabilities.
## Core Traits Reference
| Trait | Purpose |
|-------|---------|
| `Score<T>` | Declares desired state ("what") |
| `Topology` | Represents infrastructure ("where") |
| `Interpret<T>` | Execution logic ("how") |
| `Capability` | A feature exposed by a Topology |
See [Core Concepts](../concepts.md) for the conceptual foundation.

View File

@@ -1,42 +1,230 @@
# Getting Started Guide
Welcome to Harmony! This guide will walk you through installing the Harmony framework, setting up a new project, and deploying your first application.
This guide walks you through deploying your first application with Harmony — a PostgreSQL cluster on a local Kubernetes cluster (K3D). By the end, you'll understand the core workflow: compile a Score, run it through the Harmony CLI, and verify the result.
We will build and deploy the "Rust Web App" example, which automatically:
## What you'll deploy
1. Provisions a local K3D (Kubernetes in Docker) cluster.
2. Deploys a sample Rust web application.
3. Sets up monitoring for the application.
A fully functional PostgreSQL cluster running in a local K3D cluster, managed by the CloudNativePG operator. This demonstrates the full Harmony pattern:
1. Provision a local Kubernetes cluster (K3D)
2. Install the required operator (CloudNativePG)
3. Create a PostgreSQL cluster
4. Expose it as a Kubernetes Service
## Prerequisites
Before you begin, you'll need a few tools installed on your system:
Before you begin, install the following tools:
- **Rust & Cargo:** [Install Rust](https://www.rust-lang.org/tools/install)
- **Docker:** [Install Docker](https://docs.docker.com/get-docker/) (Required for the K3D local cluster)
- **kubectl:** [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (For inspecting the cluster)
- **Rust & Cargo:** [Install Rust](https://rust-lang.org/tools/install) (edition 2024)
- **Docker:** [Install Docker](https://docs.docker.com/get-docker/) (required for the local K3D cluster)
- **kubectl:** [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (optional, for inspecting the cluster)
## 1. Install Harmony
First, clone the Harmony repository and build the project. This gives you the `harmony` CLI and all the core libraries.
## Step 1: Clone and build
```bash
# Clone the main repository
# Clone the repository
git clone https://git.nationtech.io/nationtech/harmony
cd harmony
# Build the project (this may take a few minutes)
# Build the project (this may take a few minutes on first run)
cargo build --release
```
...
## Step 2: Run the PostgreSQL example
## Next Steps
```bash
cargo run -p example-postgresql
```
Congratulations, you've just deployed an application using true infrastructure-as-code!
Harmony will output its progress as it:
From here, you can:
1. **Creates a K3D cluster** named `harmony-postgres-example` (first run only)
2. **Installs the CloudNativePG operator** into the cluster
3. **Creates a PostgreSQL cluster** with 1 instance and 1 GiB of storage
4. **Prints connection details** for your new database
- [Explore the Catalogs](../catalogs/README.md): See what other [Scores](../catalogs/scores.md) and [Topologies](../catalogs/topologies.md) are available.
- [Read the Use Cases](../use-cases/README.md): Check out the [OKD on Bare Metal](./use-cases/okd-on-bare-metal.md) guide for a more advanced scenario.
- [Write your own Score](../guides/writing-a-score.md): Dive into the [Developer Guide](./guides/developer-guide.md) to start building your own components.
Expected output (abbreviated):
```
[+] Cluster created
[+] Installing CloudNativePG operator
[+] Creating PostgreSQL cluster
[+] PostgreSQL cluster is ready
Namespace: harmony-postgres-example
Service: harmony-postgres-example-rw
Username: postgres
Password: <stored in secret harmony-postgres-example-db-user>
```
## Step 3: Verify the deployment
Check that the PostgreSQL pods are running:
```bash
kubectl get pods -n harmony-postgres-example
```
You should see something like:
```
NAME READY STATUS RESTARTS AGE
harmony-postgres-example-1 1/1 Running 0 2m
```
Get the database password:
```bash
kubectl get secret -n harmony-postgres-example harmony-postgres-example-db-user -o jsonpath='{.data.password}' | base64 -d
```
## Step 4: Connect to the database
Forward the PostgreSQL port to your local machine:
```bash
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 5432:5432
```
In another terminal, connect with `psql`:
```bash
psql -h localhost -p 5432 -U postgres
# Enter the password from Step 4 when prompted
```
Try a simple query:
```sql
SELECT version();
```
## Step 5: Clean up
To delete the PostgreSQL cluster and the local K3D cluster:
```bash
k3d cluster delete harmony-postgres-example
```
Alternatively, just delete the PostgreSQL cluster without removing K3D:
```bash
kubectl delete namespace harmony-postgres-example
```
## How it works
The example code (`examples/postgresql/src/main.rs`) is straightforward:
```rust
use harmony::{
inventory::Inventory,
modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig},
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(postgres)],
None,
)
.await
.unwrap();
}
```
- **`Inventory::autoload()`** discovers the local environment (or uses an existing inventory)
- **`K8sAnywhereTopology::from_env()`** connects to K3D if `HARMONY_AUTOINSTALL=true` (the default), or to any Kubernetes cluster via `KUBECONFIG`
- **`harmony_cli::run(...)`** executes the Score against the Topology, managing the full lifecycle
## Connecting to an existing cluster
By default, Harmony provisions a local K3D cluster. To use an existing Kubernetes cluster instead:
```bash
export KUBECONFIG=/path/to/your/kubeconfig
export HARMONY_USE_LOCAL_K3D=false
export HARMONY_AUTOINSTALL=false
cargo run -p example-postgresql
```
## Troubleshooting
### Docker is not running
```
Error: could not create cluster: docker is not running
```
Start Docker and try again.
### K3D cluster creation fails
```
Error: failed to create k3d cluster
```
Ensure you have at least 2 CPU cores and 4 GiB of RAM available for Docker.
### `kubectl` cannot connect to the cluster
```
error: unable to connect to a kubernetes cluster
```
After Harmony creates the cluster, it writes the kubeconfig to `~/.kube/config` or to the path in `KUBECONFIG`. Verify:
```bash
kubectl cluster-info --context k3d-harmony-postgres-example
```
### Port forward fails
```
error: unable to forward port
```
Make sure no other process is using port 5432, or use a different local port:
```bash
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 15432:5432
psql -h localhost -p 15432 -U postgres
```
## Next steps
- [Explore the Scores Catalog](../catalogs/scores.md): See what other Scores are available
- [Explore the Topologies Catalog](../catalogs/topologies.md): See what infrastructure Topologies are supported
- [Read the Core Concepts](../concepts.md): Understand the Score / Topology / Interpret pattern in depth
- [OKD on Bare Metal](../use-cases/okd-on-bare-metal.md): See a complete bare-metal deployment example
## Advanced examples
Once you're comfortable with the basics, these examples demonstrate more advanced use cases. Note that some require specific infrastructure (existing Kubernetes clusters, bare-metal hardware, or multi-cluster environments):
| Example | Description | Prerequisites |
|---------|-------------|---------------|
| `monitoring` | Deploy Prometheus alerting with Discord webhooks | Existing K8s cluster |
| `ntfy` | Deploy ntfy notification server | Existing K8s cluster |
| `tenant` | Create a multi-tenant namespace with quotas | Existing K8s cluster |
| `cert_manager` | Provision TLS certificates | Existing K8s cluster |
| `validate_ceph_cluster_health` | Check Ceph cluster health | Existing Rook/Ceph cluster |
| `okd_pxe` / `okd_installation` | Provision OKD on bare metal | HAClusterTopology, bare-metal hardware |
To run any example:
```bash
cargo run -p example-<example_name>
```

View File

@@ -0,0 +1,164 @@
# Writing a Score
A `Score` declares _what_ you want to achieve. It is decoupled from _how_ it is achieved — that logic lives in an `Interpret`.
## The Pattern
A Score consists of two parts:
1. **A struct** — holds the configuration for your desired state
2. **A `Score<T>` implementation** — returns an `Interpret` that knows how to execute
An `Interpret` contains the actual execution logic and connects your Score to the capabilities exposed by a `Topology`.
## Example: A Simple Score
Here's a simplified version of `NtfyScore` from the `ntfy` module:
```rust
use async_trait::async_trait;
use harmony::{
interpret::{Interpret, InterpretError, Outcome},
inventory::Inventory,
score::Score,
topology::{HelmCommand, K8sclient, Topology},
};
/// MyScore declares "I want to install the ntfy server"
#[derive(Debug, Clone)]
pub struct MyScore {
pub namespace: String,
pub host: String,
}
impl<T: Topology + HelmCommand + K8sclient> Score<T> for MyScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(MyInterpret { score: self.clone() })
}
fn name(&self) -> String {
"ntfy [MyScore]".into()
}
}
/// MyInterpret knows _how_ to install ntfy using the Topology's capabilities
#[derive(Debug)]
pub struct MyInterpret {
pub score: MyScore,
}
#[async_trait]
impl<T: Topology + HelmCommand + K8sclient> Interpret<T> for MyInterpret {
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
// 1. Get a Kubernetes client from the Topology
let client = topology.k8s_client().await?;
// 2. Use Helm to install the ntfy chart
// (via topology's HelmCommand capability)
// 3. Wait for the deployment to be ready
client
.wait_until_deployment_ready("ntfy", Some(&self.score.namespace), None)
.await?;
Ok(Outcome::success("ntfy installed".to_string()))
}
}
```
## The Compile-Time Safety Check
The generic `Score<T>` trait is bounded by `T: Topology`. This means the compiler enforces that your Score only runs on Topologies that expose the capabilities your Interpret needs:
```rust
// This only compiles if K8sAnywhereTopology (or any T)
// implements HelmCommand and K8sclient
impl<T: Topology + HelmCommand + K8sclient> Score<T> for MyScore { ... }
```
If you try to run this Score against a Topology that doesn't expose `HelmCommand`, you get a compile error — before any code runs.
## Using Your Score
Once defined, your Score integrates with the Harmony CLI:
```rust
use harmony::{
inventory::Inventory,
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
let my_score = MyScore {
namespace: "monitoring".to_string(),
host: "ntfy.example.com".to_string(),
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(my_score)],
None,
)
.await
.unwrap();
}
```
## Key Patterns
### Composing Scores
Scores can include other Scores via features:
```rust
let app = ApplicationScore {
features: vec![
Box::new(PackagingDeployment { application: app.clone() }),
Box::new(Monitoring { application: app.clone(), alert_receiver: vec![] }),
],
application: app,
};
```
### Reusing Interpret Logic
Many Scores delegate to shared `Interpret` implementations. For example, `HelmChartScore` provides a reusable Interpret for any Helm-based deployment. Your Score can wrap it:
```rust
impl<T: Topology + HelmCommand> Score<T> for MyScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(HelmChartInterpret { /* your config */ })
}
}
```
### Accessing Topology Capabilities
Your Interpret accesses infrastructure through Capabilities exposed by the Topology:
```rust
// Via the Topology trait directly
let k8s_client = topology.k8s_client().await?;
let helm = topology.get_helm_command();
// Or via Capability traits
impl<T: Topology + K8sclient> Interpret<T> for MyInterpret {
async fn execute(...) {
let client = topology.k8s_client().await?;
// use client...
}
}
```
## Best Practices
- **Keep Scores focused** — one Score per concern (deployment, monitoring, networking)
- **Use `..Default::default()`** for optional fields so callers only need to specify what they care about
- **Return `Outcome`** — use `Outcome::success`, `Outcome::failure`, or `Outcome::success_with_details` to communicate results clearly
- **Handle errors gracefully** — return meaningful `InterpretError` messages that help operators debug issues

View File

@@ -0,0 +1,176 @@
# Writing a Topology
A `Topology` models your infrastructure environment and exposes `Capability` traits that Scores use to interact with it. Where a Score declares _what_ you want, a Topology exposes _what_ it can do.
## The Minimum Implementation
At minimum, a Topology needs:
```rust
use async_trait::async_trait;
use harmony::{
topology::{PreparationError, PreparationOutcome, Topology},
};
#[derive(Debug, Clone)]
pub struct MyTopology {
pub name: String,
}
#[async_trait]
impl Topology for MyTopology {
fn name(&self) -> &str {
"MyTopology"
}
async fn ensure_ready(&self) -> Result<PreparationOutcome, PreparationError> {
// Verify the infrastructure is accessible and ready
Ok(PreparationOutcome::Success { details: "ready".to_string() })
}
}
```
## Implementing Capabilities
Scores express dependencies on Capabilities through trait bounds. For example, if your Topology should support Scores that deploy Helm charts, implement `HelmCommand`:
```rust
use std::process::Command;
use harmony::topology::HelmCommand;
impl HelmCommand for MyTopology {
fn get_helm_command(&self) -> Command {
let mut cmd = Command::new("helm");
if let Some(kubeconfig) = &self.kubeconfig {
cmd.arg("--kubeconfig").arg(kubeconfig);
}
cmd
}
}
```
For Scores that need a Kubernetes client, implement `K8sclient`:
```rust
use std::sync::Arc;
use harmony_k8s::K8sClient;
use harmony::topology::K8sclient;
#[async_trait]
impl K8sclient for MyTopology {
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
let client = if let Some(kubeconfig) = &self.kubeconfig {
K8sClient::from_kubeconfig(kubeconfig).await?
} else {
K8sClient::try_default().await?
};
Ok(Arc::new(client))
}
}
```
## Loading Topology from Environment
For flexibility, implement `from_env()` to read configuration from environment variables:
```rust
impl MyTopology {
pub fn from_env() -> Self {
Self {
name: std::env::var("MY_TOPOLOGY_NAME")
.unwrap_or_else(|_| "default".to_string()),
kubeconfig: std::env::var("KUBECONFIG").ok(),
}
}
}
```
This pattern lets operators switch between environments without recompiling:
```bash
export KUBECONFIG=/path/to/prod-cluster.kubeconfig
cargo run --example my_example
```
## Complete Example: K8sAnywhereTopology
The `K8sAnywhereTopology` is the most commonly used Topology and handles both local (K3D) and remote Kubernetes clusters:
```rust
pub struct K8sAnywhereTopology {
pub k8s_state: Arc<OnceCell<K8sState>>,
pub tenant_manager: Arc<OnceCell<TenantManager>>,
pub config: Arc<K8sAnywhereConfig>,
}
#[async_trait]
impl Topology for K8sAnywhereTopology {
fn name(&self) -> &str {
"K8sAnywhereTopology"
}
async fn ensure_ready(&self) -> Result<PreparationOutcome, PreparationError> {
// 1. If autoinstall is enabled and no cluster exists, provision K3D
// 2. Verify kubectl connectivity
// 3. Optionally wait for cluster operators to be ready
Ok(PreparationOutcome::Success { details: "cluster ready".to_string() })
}
}
```
## Key Patterns
### Lazy Initialization
Use `OnceCell` for expensive resources like Kubernetes clients:
```rust
pub struct K8sAnywhereTopology {
k8s_state: Arc<OnceCell<K8sState>>,
}
```
### Multi-Target Topologies
For Scores that span multiple clusters (like NATS supercluster), implement `MultiTargetTopology`:
```rust
pub trait MultiTargetTopology: Topology {
fn current_target(&self) -> &str;
fn set_target(&mut self, target: &str);
}
```
### Composing Topologies
Complex topologies combine multiple infrastructure components:
```rust
pub struct HAClusterTopology {
pub router: Arc<dyn Router>,
pub load_balancer: Arc<dyn LoadBalancer>,
pub firewall: Arc<dyn Firewall>,
pub dhcp_server: Arc<dyn DhcpServer>,
pub dns_server: Arc<dyn DnsServer>,
pub kubeconfig: Option<String>,
// ...
}
```
## Testing Your Topology
Test Topologies in isolation by implementing them against mock infrastructure:
```rust
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_topology_ensure_ready() {
let topo = MyTopology::from_env();
let result = topo.ensure_ready().await;
assert!(result.is_ok());
}
}
```

View File

@@ -1,443 +0,0 @@
# Monitoring and Alerting in Harmony
Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.
## Overview
Harmony's monitoring module supports three distinct use cases:
| Level | Who Uses It | What It Provides |
|-------|-------------|------------------|
| **Cluster** | Cluster administrators | Full control over monitoring stack, cluster-wide alerts, external scrape targets |
| **Tenant** | Platform teams | Namespace-scoped monitoring in multi-tenant environments |
| **Application** | Application developers | Zero-config monitoring that "just works" |
Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.
## Core Concepts
### AlertSender
An `AlertSender` represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:
| Sender | Description | Use When |
|--------|-------------|----------|
| `OpenshiftClusterAlertSender` | OKD/OpenShift built-in monitoring | Running on OKD/OpenShift |
| `KubePrometheus` | kube-prometheus-stack via Helm | Standard Kubernetes, need full stack |
| `Prometheus` | Standalone Prometheus | Custom Prometheus deployment |
| `RedHatClusterObservability` | RHOB operator | Red Hat managed clusters |
| `Grafana` | Grafana-managed alerting | Grafana as primary alerting layer |
### AlertReceiver
An `AlertReceiver` defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.
```rust
pub trait AlertReceiver<S: AlertSender> {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
}
```
Built-in receivers:
- `DiscordReceiver` - Discord webhooks
- `WebhookReceiver` - Generic HTTP webhooks
### AlertRule
An `AlertRule` defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.
```rust
pub trait AlertRule<S: AlertSender> {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
}
```
### Observability Capability
Topologies implement `Observability<S>` to indicate they support a specific alert sender:
```rust
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
```
This provides **compile-time verification**: if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't implement `Observability<OpenshiftClusterAlertSender>`, the code won't compile.
---
## Level 1: Cluster Monitoring
Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:
- Installing/managing the monitoring stack
- Configuring cluster-wide alert receivers
- Defining cluster-level alert rules
- Adding external scrape targets (e.g., bare-metal servers, firewalls)
### Example: OKD Cluster Alerts
```rust
use harmony::{
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
},
topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};
let severity_matcher = AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
};
let rule_group = AlertManagerRuleGroup::new(
"cluster-rules",
vec![high_pvc_fill_rate_over_two_days()],
);
let external_exporter = PrometheusNodeExporter {
job_name: "firewall".to_string(),
metrics_path: "/metrics".to_string(),
listen_address: ip!("192.168.1.1"),
port: 9100,
..Default::default()
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver {
name: "critical-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/..."),
route: AlertRoute {
matchers: vec![severity_matcher],
..AlertRoute::default("critical-alerts".to_string())
},
})],
rules: vec![Box::new(rule_group)],
scrape_targets: Some(vec![Box::new(external_exporter)]),
})],
None,
).await?;
```
### What This Does
1. **Enables cluster monitoring** - Activates OKD's built-in Prometheus
2. **Enables user workload monitoring** - Allows namespace-scoped rules
3. **Configures Alertmanager** - Adds Discord receiver with route matching
4. **Deploys alert rules** - Creates `AlertingRule` CRD with PVC fill rate alert
5. **Adds external scrape target** - Configures Prometheus to scrape the firewall
### Compile-Time Safety
The `OpenshiftClusterAlertScore` requires:
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore
```
If `K8sAnywhereTopology` didn't implement `Observability<OpenshiftClusterAlertSender>`, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.
---
## Level 2: Tenant Monitoring
In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:
- Resources are deployed in the tenant's namespace
- Cannot modify cluster-level monitoring configuration
- The topology determines namespace context at runtime
### How It Works
The topology's `Observability` implementation handles tenant scoping:
```rust
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender, inventory, rules) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or_else(|| "monitoring".to_string());
// Rules are installed in the appropriate namespace
for rule in rules.unwrap_or_default() {
let score = KubePrometheusRuleScore {
sender: sender.clone(),
rule,
namespace: namespace.clone(), // Tenant namespace
};
score.create_interpret().execute(inventory, self).await?;
}
}
}
```
### Tenant vs Cluster Resources
| Resource | Cluster-Level | Tenant-Level |
|----------|---------------|--------------|
| Alertmanager config | Global receivers | Namespaced receivers (where supported) |
| PrometheusRules | Cluster-wide alerts | Namespace alerts only |
| ServiceMonitors | Any namespace | Own namespace only |
| External scrape targets | Can add | Cannot add (cluster config) |
### Runtime Validation
Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.
This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.
---
## Level 3: Application Monitoring
Application monitoring provides zero-config, opinionated monitoring for developers. Just add the `Monitoring` feature to your application and it works.
### Example
```rust
use harmony::modules::{
application::{Application, ApplicationFeature},
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};
// Define your application
let my_app = MyApplication::new();
// Add monitoring as a feature
let monitoring = Monitoring {
application: Arc::new(my_app),
alert_receiver: vec![], // Uses defaults
};
// Install with the application
my_app.add_feature(monitoring);
```
### What Application Monitoring Provides
1. **Automatic ServiceMonitor** - Creates a ServiceMonitor for your application's pods
2. **Ntfy Notification Channel** - Auto-installs and configures Ntfy for push notifications
3. **Tenant Awareness** - Automatically scopes to the correct namespace
4. **Sensible Defaults** - Pre-configured alert routes and receivers
### Under the Hood
```rust
impl<T: Topology + Observability<Prometheus> + TenantManager>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// 1. Get tenant namespace (or use app name)
let namespace = topology.get_tenant_config().await
.map(|ns| ns.name.clone())
.unwrap_or_else(|| self.application.name());
// 2. Create ServiceMonitor for the app
let app_service_monitor = ServiceMonitor {
metadata: ObjectMeta {
name: Some(self.application.name()),
namespace: Some(namespace.clone()),
..Default::default()
},
spec: ServiceMonitorSpec::default(),
};
// 3. Install Ntfy for notifications
let ntfy = NtfyScore { namespace, host };
ntfy.interpret(&Inventory::empty(), topology).await?;
// 4. Wire up webhook receiver to Ntfy
let ntfy_receiver = WebhookReceiver { ... };
// 5. Execute monitoring score
alerting_score.interpret(&Inventory::empty(), topology).await?;
}
}
```
---
## Pre-Built Alert Rules
Harmony provides a library of common alert rules in `modules/monitoring/alert_rule/alerts/`:
### Kubernetes Alerts (`alerts/k8s/`)
```rust
use harmony::modules::monitoring::alert_rule::alerts::k8s::{
pod::pod_failed,
pvc::high_pvc_fill_rate_over_two_days,
memory_usage::alert_high_memory_usage,
};
let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
pod_failed(),
high_pvc_fill_rate_over_two_days(),
alert_high_memory_usage(),
]);
```
Available rules:
- `pod_failed()` - Pod in failed state
- `alert_container_restarting()` - Container restart loop
- `alert_pod_not_ready()` - Pod not ready for extended period
- `high_pvc_fill_rate_over_two_days()` - PVC will fill within 2 days
- `alert_high_memory_usage()` - Memory usage above threshold
- `alert_high_cpu_usage()` - CPU usage above threshold
### Infrastructure Alerts (`alerts/infra/`)
```rust
use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;
let rules = AlertManagerRuleGroup::new("infra-rules", vec![
high_http_error_rate(),
]);
```
### Creating Custom Rules
```rust
use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "My service is down")
.annotation("description", "The my-service job has been down for more than 5 minutes")
}
```
---
## Alert Receivers
### Discord Webhook
```rust
use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};
let discord = DiscordReceiver {
name: "ops-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
route: AlertRoute {
receiver: "ops-alerts".to_string(),
matchers: vec![AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
}],
group_by: vec!["alertname".to_string()],
repeat_interval: Some("30m".to_string()),
continue_matching: false,
children: vec![],
},
};
```
### Generic Webhook
```rust
use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;
let webhook = WebhookReceiver {
name: "custom-webhook".to_string(),
url: hurl!("https://api.example.com/alerts"),
route: AlertRoute::default("custom-webhook".to_string()),
};
```
---
## Adding a New Monitoring Stack
To add support for a new monitoring stack:
1. **Create the sender type** in `modules/monitoring/my_sender/mod.rs`:
```rust
#[derive(Debug, Clone)]
pub struct MySender;
impl AlertSender for MySender {
fn name(&self) -> String { "MySender".to_string() }
}
```
2. **Define CRD types** in `modules/monitoring/my_sender/crd/`:
```rust
#[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
#[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
pub struct MyAlertRuleSpec { ... }
```
3. **Implement Observability** in `domain/topology/k8s_anywhere/observability/my_sender.rs`:
```rust
impl Observability<MySender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
```
4. **Implement receiver conversions** for existing receivers:
```rust
impl AlertReceiver<MySender> for DiscordReceiver {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
// Convert DiscordReceiver to MySender's format
}
}
```
5. **Create score types**:
```rust
pub struct MySenderAlertScore {
pub sender: MySender,
pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
pub rules: Vec<Box<dyn AlertRule<MySender>>>,
}
```
---
## Architecture Principles
### Type Safety Over Flexibility
Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.
### Compile-Time Capability Verification
The `Observability<S>` bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.
### Explicit Over Implicit
Monitoring stacks are chosen explicitly (`OpenshiftClusterAlertSender` vs `KubePrometheus`). There's no "auto-detection" that could lead to surprising behavior.
### Three Levels, One Foundation
Cluster, tenant, and application monitoring all use the same traits (`AlertSender`, `AlertReceiver`, `AlertRule`). The difference is in how scores are constructed and how topologies interpret them.
---
## Related Documentation
- [ADR-020: Monitoring and Alerting Architecture](../adr/020-monitoring-alerting-architecture.md)
- [ADR-013: Monitoring Notifications (ntfy)](../adr/013-monitoring-notifications.md)
- [ADR-011: Multi-Tenant Cluster Architecture](../adr/011-multi-tenant-cluster.md)
- [Coding Guide](coding-guide.md)
- [Core Concepts](concepts.md)

17
docs/use-cases/README.md Normal file
View File

@@ -0,0 +1,17 @@
# Use Cases
Real-world scenarios demonstrating Harmony in action.
## Available Use Cases
### [PostgreSQL on Local K3D](./postgresql-on-local-k3d.md)
Deploy a fully functional PostgreSQL cluster on a local K3D cluster in under 10 minutes. The quickest way to see Harmony in action.
### [OKD on Bare Metal](./okd-on-bare-metal.md)
A complete walkthrough of bootstrapping a high-availability OKD cluster from physical hardware. Covers inventory discovery, bootstrap, control plane, and worker provisioning.
---
_These use cases are community-tested scenarios. For questions or contributions, open an issue on the [Harmony repository](https://git.nationtech.io/NationTech/harmony/issues)._

View File

@@ -0,0 +1,159 @@
# Use Case: OKD on Bare Metal
Provision a production-grade OKD (OpenShift Kubernetes Distribution) cluster from physical hardware using Harmony. This use case covers the full lifecycle: hardware discovery, bootstrap, control plane, workers, and post-install validation.
## What you'll have at the end
A highly-available OKD cluster with:
- 3 control plane nodes
- 2+ worker nodes
- Network bonding configured on nodes and switches
- Load balancer routing API and ingress traffic
- DNS and DHCP services for the cluster
- Post-install health validation
## Target hardware model
This setup assumes a typical lab environment:
```
┌─────────────────────────────────────────────────────────┐
│ Network 192.168.x.0/24 (flat, DHCP + PXE capable) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ cp0 │ │ cp1 │ │ cp2 │ (control) │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ wk0 │ │ wk1 │ ... (workers) │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ │
│ │ bootstrap│ (temporary, can be repurposed) │
│ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ firewall │ │ switch │ (OPNsense + Brocade) │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
```
## Required infrastructure
Harmony models this as an `HAClusterTopology`, which requires these capabilities:
| Capability | Implementation |
|------------|---------------|
| **Router** | OPNsense firewall |
| **Load Balancer** | OPNsense HAProxy |
| **Firewall** | OPNsense |
| **DHCP Server** | OPNsense |
| **TFTP Server** | OPNsense |
| **HTTP Server** | OPNsense |
| **DNS Server** | OPNsense |
| **Node Exporter** | Prometheus node_exporter on OPNsense |
| **Switch Client** | Brocade SNMP |
See `examples/okd_installation/` for a reference topology implementation.
## The Provisioning Pipeline
Harmony orchestrates OKD installation in ordered stages:
### Stage 1: Inventory Discovery (`OKDSetup01InventoryScore`)
Harmony boots all nodes via PXE into a CentOS Stream live environment, runs an inventory agent on each, and collects:
- MAC addresses and NIC details
- IP addresses assigned by DHCP
- Hardware profile (CPU, RAM, storage)
This is the "discovery-first" approach: no pre-configuration required on nodes.
### Stage 2: Bootstrap Node (`OKDSetup02BootstrapScore`)
The user selects one discovered node to serve as the bootstrap node. Harmony:
- Renders per-MAC iPXE boot configuration with OKD 4.19 SCOS live assets + ignition
- Reboots the bootstrap node via SSH
- Waits for the bootstrap process to complete (API server becomes available)
### Stage 3: Control Plane (`OKDSetup03ControlPlaneScore`)
With bootstrap complete, Harmony provisions the control plane nodes:
- Renders per-MAC iPXE for each control plane node
- Reboots via SSH and waits for node to join the cluster
- Applies network bond configuration via NMState MachineConfig where relevant
### Stage 4: Network Bonding (`OKDSetupPersistNetworkBondScore`)
Configures LACP bonds on nodes and corresponding port-channels on the switch stack for high-availability.
### Stage 5: Worker Nodes (`OKDSetup04WorkersScore`)
Provisions worker nodes similarly to control plane, joining them to the cluster.
### Stage 6: Sanity Check (`OKDSetup05SanityCheckScore`)
Validates:
- API server is reachable
- Ingress controller is operational
- Cluster operators are healthy
- SDN (software-defined networking) is functional
### Stage 7: Installation Report (`OKDSetup06InstallationReportScore`)
Produces a machine-readable JSON report and human-readable summary of the installation.
## Network notes
**During discovery:** Ports must be in access mode (no LACP). DHCP succeeds; iPXE loads CentOS Stream live with Kickstart and starts the inventory endpoint.
**During provisioning:** After SCOS is on disk and Ignition/MachineConfig can be applied, bonds are set persistently. This avoids the PXE/DHCP recovery race condition that occurs if bonding is configured too early.
**PXE limitation:** The generic discovery path cannot use bonded networks for PXE boot because the DHCP recovery process conflicts with bond formation.
## Configuration knobs
When using `OKDInstallationPipeline`, configure these domains:
| Parameter | Example | Description |
|-----------|---------|-------------|
| `public_domain` | `apps.example.com` | Wildcard domain for application ingress |
| `internal_domain` | `cluster.local` | Internal cluster DNS domain |
## Running the example
See `examples/okd_installation/` for a complete reference. The topology must be configured with your infrastructure details:
```bash
# Configure the example with your hardware/network specifics
# See examples/okd_installation/src/topology.rs
cargo run -p example-okd_installation
```
This example requires:
- Physical hardware configured as described above
- OPNsense firewall with SSH access
- Brocade switch with SNMP access
- All nodes connected to the same Layer 2 network
## Post-install
After the cluster is bootstrapped, `~/.kube/config` is updated with the cluster credentials. Verify:
```bash
kubectl get nodes
kubectl get pods -n openshift-monitoring
oc get routes -n openshift-console
```
## Next steps
- Enable monitoring with `PrometheusAlertScore` or `OpenshiftClusterAlertScore`
- Configure TLS certificates with `CertManagerHelmScore`
- Add storage with Rook Ceph
- Scale workers with `OKDSetup04WorkersScore`
## Further reading
- [OKD Installation Module](../../harmony/src/modules/okd/installation.rs) — source of truth for pipeline stages
- [HAClusterTopology](../../harmony/src/domain/topology/ha_cluster.rs) — infrastructure capability model
- [Scores Catalog](../catalogs/scores.md) — all available Scores including OKD-specific ones

View File

@@ -0,0 +1,115 @@
# Use Case: PostgreSQL on Local K3D
Deploy a production-grade PostgreSQL cluster on a local Kubernetes cluster (K3D) using Harmony. This is the fastest way to get started with Harmony and requires no external infrastructure.
## What you'll have at the end
A fully operational PostgreSQL cluster with:
- 1 primary instance with 1 GiB of storage
- CloudNativePG operator managing the cluster lifecycle
- Automatic failover support (foundation for high-availability)
- Exposed as a Kubernetes Service for easy connection
## Prerequisites
- Rust 2024 edition
- Docker running locally
- ~5 minutes
## The Score
The entire deployment is expressed in ~20 lines of Rust:
```rust
use harmony::{
inventory::Inventory,
modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig},
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(postgres)],
None,
)
.await
.unwrap();
}
```
## What Harmony does
When you run this, Harmony:
1. **Connects to K8sAnywhereTopology** — this auto-provisions a K3D cluster if none exists
2. **Installs the CloudNativePG operator** — one-time setup that enables PostgreSQL cluster management in Kubernetes
3. **Creates a PostgreSQL cluster** — Harmony translates the Score into a `Cluster` CRD and applies it
4. **Exposes the database** — creates a Kubernetes Service for the PostgreSQL primary
## Running it
```bash
cargo run -p example-postgresql
```
## Verifying the deployment
```bash
# Check pods
kubectl get pods -n harmony-postgres-example
# Get the password
PASSWORD=$(kubectl get secret -n harmony-postgres-example \
harmony-postgres-example-db-user \
-o jsonpath='{.data.password}' | base64 -d)
# Connect via port-forward
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 5432:5432
psql -h localhost -p 5432 -U postgres -W "$PASSWORD"
```
## Customizing the deployment
The `PostgreSQLConfig` struct supports:
| Field | Default | Description |
|-------|---------|-------------|
| `cluster_name` | — | Name of the PostgreSQL cluster |
| `namespace` | — | Kubernetes namespace to deploy to |
| `instances` | `1` | Number of instances |
| `storage_size` | `1Gi` | Persistent storage size per instance |
Example with custom settings:
```rust
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "my-prod-db".to_string(),
namespace: "database".to_string(),
instances: 3,
storage_size: "10Gi".to_string().into(),
..Default::default()
},
};
```
## Extending the pattern
This pattern extends to any Kubernetes-native workload:
- Add **monitoring** by including a `Monitoring` feature alongside your Score
- Add **TLS certificates** by including a `CertificateScore`
- Add **tenant isolation** by wrapping in a `TenantScore`
See [Scores Catalog](../catalogs/scores.md) for the full list.

127
examples/README.md Normal file
View File

@@ -0,0 +1,127 @@
# Examples
This directory contains runnable examples demonstrating Harmony's capabilities. Each example is a self-contained program that can be run with `cargo run -p example-<name>`.
## Quick Reference
| Example | Description | Local K3D | Existing Cluster | Hardware Needed |
|---------|-------------|:---------:|:----------------:|:---------------:|
| `postgresql` | Deploy a PostgreSQL cluster | ✅ | ✅ | — |
| `ntfy` | Deploy ntfy notification server | ✅ | ✅ | — |
| `tenant` | Create a multi-tenant namespace | ✅ | ✅ | — |
| `cert_manager` | Provision TLS certificates | ✅ | ✅ | — |
| `node_health` | Check Kubernetes node health | ✅ | ✅ | — |
| `monitoring` | Deploy Prometheus alerting | ✅ | ✅ | — |
| `monitoring_with_tenant` | Monitoring + tenant isolation | ✅ | ✅ | — |
| `operatorhub_catalog` | Install OperatorHub catalog | ✅ | ✅ | — |
| `validate_ceph_cluster_health` | Verify Ceph cluster health | — | ✅ | Rook/Ceph |
| `remove_rook_osd` | Remove a Rook OSD | — | ✅ | Rook/Ceph |
| `brocade_snmp_server` | Configure Brocade switch SNMP | — | ✅ | Brocade switch |
| `opnsense_node_exporter` | Node exporter on OPNsense | — | ✅ | OPNsense firewall |
| `okd_pxe` | PXE boot configuration for OKD | — | — | ✅ |
| `okd_installation` | Full OKD bare-metal install | — | — | ✅ |
| `okd_cluster_alerts` | OKD cluster monitoring alerts | — | ✅ | OKD cluster |
| `multisite_postgres` | Multi-site PostgreSQL failover | — | ✅ | Multi-cluster |
| `nats` | Deploy NATS messaging | — | ✅ | Multi-cluster |
| `nats-supercluster` | NATS supercluster across sites | — | ✅ | Multi-cluster |
| `lamp` | LAMP stack deployment | ✅ | ✅ | — |
| `openbao` | Deploy OpenBao vault | ✅ | ✅ | — |
| `zitadel` | Deploy Zitadel identity provider | ✅ | ✅ | — |
| `try_rust_webapp` | Rust webapp with packaging | ✅ | ✅ | Submodule |
| `rust` | Rust webapp with full monitoring | ✅ | ✅ | — |
| `rhob_application_monitoring` | RHOB monitoring setup | ✅ | ✅ | — |
| `sttest` | Full OKD stack test | — | — | ✅ |
| `application_monitoring_with_tenant` | App monitoring + tenant | — | ✅ | OKD cluster |
| `kube-rs` | Direct kube-rs client usage | ✅ | ✅ | — |
| `k8s_drain_node` | Drain a Kubernetes node | ✅ | ✅ | — |
| `k8s_write_file_on_node` | Write files to K8s nodes | ✅ | ✅ | — |
| `harmony_inventory_builder` | Discover hosts via subnet scan | ✅ | — | — |
| `cli` | CLI tool with inventory discovery | ✅ | — | — |
| `tui` | Terminal UI demonstration | ✅ | — | — |
## Status Legend
| Symbol | Meaning |
|--------|---------|
| ✅ | Works out-of-the-box |
| — | Not applicable or requires specific setup |
## By Category
### Data Services
- **`postgresql`** — Deploy a PostgreSQL cluster via CloudNativePG
- **`multisite_postgres`** — Multi-site PostgreSQL with failover
- **`public_postgres`** — Public-facing PostgreSQL (⚠️ uses NationTech DNS)
### Kubernetes Utilities
- **`node_health`** — Check node health in a cluster
- **`k8s_drain_node`** — Drain and reboot a node
- **`k8s_write_file_on_node`** — Write files to nodes
- **`validate_ceph_cluster_health`** — Verify Ceph/Rook cluster health
- **`remove_rook_osd`** — Remove an OSD from Rook/Ceph
- **`kube-rs`** — Direct Kubernetes client usage demo
### Monitoring & Alerting
- **`monitoring`** — Deploy Prometheus alerting with Discord webhooks
- **`monitoring_with_tenant`** — Monitoring with tenant isolation
- **`ntfy`** — Deploy ntfy notification server
- **`okd_cluster_alerts`** — OKD-specific cluster alerts
### Application Deployment
- **`try_rust_webapp`** — Deploy a Rust webapp with packaging (⚠️ requires `tryrust.org` submodule)
- **`rust`** — Rust webapp with full monitoring features
- **`rhob_application_monitoring`** — Red Hat Observability Stack monitoring
- **`lamp`** — LAMP stack deployment (⚠️ uses NationTech DNS)
- **`application_monitoring_with_tenant`** — App monitoring with tenant isolation
### Infrastructure & Bare Metal
- **`okd_installation`** — Full OKD cluster from scratch
- **`okd_pxe`** — PXE boot configuration for OKD
- **`sttest`** — Full OKD stack test with specific hardware
- **`brocade_snmp_server`** — Configure Brocade switch via SNMP
- **`opnsense_node_exporter`** — Node exporter on OPNsense firewall
### Multi-Cluster
- **`nats`** — NATS deployment on a cluster
- **`nats-supercluster`** — NATS supercluster across multiple sites
- **`multisite_postgres`** — PostgreSQL with multi-site failover
### Identity & Secrets
- **`openbao`** — Deploy OpenBao vault (⚠️ uses NationTech DNS)
- **`zitadel`** — Deploy Zitadel identity provider (⚠️ uses NationTech DNS)
### Cluster Services
- **`cert_manager`** — Provision TLS certificates
- **`tenant`** — Create a multi-tenant namespace
- **`operatorhub_catalog`** — Install OperatorHub catalog sources
### Development & Testing
- **`cli`** — CLI tool with inventory discovery
- **`tui`** — Terminal UI demonstration
- **`harmony_inventory_builder`** — Host discovery via subnet scan
## Running Examples
```bash
# Build first
cargo build --release
# Run any example
cargo run -p example-postgresql
cargo run -p example-ntfy
cargo run -p example-tenant
```
For examples that need an existing Kubernetes cluster:
```bash
export KUBECONFIG=/path/to/your/kubeconfig
export HARMONY_USE_LOCAL_K3D=false
export HARMONY_AUTOINSTALL=false
cargo run -p example-monitoring
```
## Notes on Private Infrastructure
Some examples use NationTech-hosted infrastructure by default (DNS domains like `*.nationtech.io`, `*.harmony.mcd`). These are not suitable for public use without modification. See the [Getting Started Guide](../docs/guides/getting-started.md) for the recommended public examples.

View File

@@ -7,7 +7,7 @@ use harmony::{
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
tenant::TenantScore,
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute, tenant::TenantConfig},
topology::{K8sAnywhereTopology, tenant::TenantConfig},
};
use harmony_types::id::Id;
use harmony_types::net::Url;
@@ -33,14 +33,9 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "sample-webhook-receiver".to_string();
let webhook_receiver = WebhookReceiver {
name: receiver_name.clone(),
name: "sample-webhook-receiver".to_string(),
url: Url::Url(url::Url::parse("https://webhook-doesnt-exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
};
let app = ApplicationScore {

View File

@@ -1,45 +1,37 @@
use std::{
collections::HashMap,
sync::{Arc, Mutex},
};
use std::collections::HashMap;
use harmony::{
inventory::Inventory,
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{
alerts::{
infra::dell_server::{
alert_global_storage_status_critical,
alert_global_storage_status_non_recoverable,
global_storage_status_degraded_non_critical,
modules::{
monitoring::{
alert_channel::discord_alert_channel::DiscordWebhook,
alert_rule::prometheus_alert_rule::AlertManagerRuleGroup,
kube_prometheus::{
helm_prometheus_alert_score::HelmPrometheusAlertingScore,
types::{
HTTPScheme, MatchExpression, Operator, Selector, ServiceMonitor,
ServiceMonitorEndpoint,
},
k8s::pvc::high_pvc_fill_rate_over_two_days,
},
prometheus_alert_rule::AlertManagerRuleGroup,
},
kube_prometheus::{
helm::config::KubePrometheusConfig,
kube_prometheus_alerting_score::KubePrometheusAlertingScore,
types::{
HTTPScheme, MatchExpression, Operator, Selector, ServiceMonitor,
ServiceMonitorEndpoint,
prometheus::alerts::{
infra::dell_server::{
alert_global_storage_status_critical, alert_global_storage_status_non_recoverable,
global_storage_status_degraded_non_critical,
},
k8s::pvc::high_pvc_fill_rate_over_two_days,
},
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute},
topology::K8sAnywhereTopology,
};
use harmony_types::{k8s_name::K8sName, net::Url};
#[tokio::main]
async fn main() {
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
selectors: vec![],
};
let high_pvc_fill_rate_over_two_days_alert = high_pvc_fill_rate_over_two_days();
@@ -78,15 +70,10 @@ async fn main() {
endpoints: vec![service_monitor_endpoint],
..Default::default()
};
let config = Arc::new(Mutex::new(KubePrometheusConfig::new()));
let alerting_score = KubePrometheusAlertingScore {
let alerting_score = HelmPrometheusAlertingScore {
receivers: vec![Box::new(discord_receiver)],
rules: vec![Box::new(additional_rules), Box::new(additional_rules2)],
service_monitors: vec![service_monitor],
scrape_targets: None,
config,
};
harmony_cli::run(

View File

@@ -1,32 +1,24 @@
use std::{
collections::HashMap,
str::FromStr,
sync::{Arc, Mutex},
};
use std::{collections::HashMap, str::FromStr};
use harmony::{
inventory::Inventory,
modules::{
monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{
alerts::k8s::pvc::high_pvc_fill_rate_over_two_days,
prometheus_alert_rule::AlertManagerRuleGroup,
},
alert_channel::discord_alert_channel::DiscordWebhook,
alert_rule::prometheus_alert_rule::AlertManagerRuleGroup,
kube_prometheus::{
helm::config::KubePrometheusConfig,
kube_prometheus_alerting_score::KubePrometheusAlertingScore,
helm_prometheus_alert_score::HelmPrometheusAlertingScore,
types::{
HTTPScheme, MatchExpression, Operator, Selector, ServiceMonitor,
ServiceMonitorEndpoint,
},
},
},
prometheus::alerts::k8s::pvc::high_pvc_fill_rate_over_two_days,
tenant::TenantScore,
},
topology::{
K8sAnywhereTopology,
monitoring::AlertRoute,
tenant::{ResourceLimits, TenantConfig, TenantNetworkPolicy},
},
};
@@ -50,13 +42,10 @@ async fn main() {
},
};
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
selectors: vec![],
};
let high_pvc_fill_rate_over_two_days_alert = high_pvc_fill_rate_over_two_days();
@@ -85,14 +74,10 @@ async fn main() {
..Default::default()
};
let config = Arc::new(Mutex::new(KubePrometheusConfig::new()));
let alerting_score = KubePrometheusAlertingScore {
let alerting_score = HelmPrometheusAlertingScore {
receivers: vec![Box::new(discord_receiver)],
rules: vec![Box::new(additional_rules)],
service_monitors: vec![service_monitor],
scrape_targets: None,
config,
};
harmony_cli::run(

View File

@@ -14,7 +14,6 @@ async fn main() {
..Default::default() // Use harmony defaults, they are based on CNPG's default values :
// "default" namespace, 1 instance, 1Gi storage
},
hostname: "postgrestest.sto1.nationtech.io".to_string(),
};
harmony_cli::run(

View File

@@ -1,72 +1,36 @@
use std::collections::HashMap;
use harmony::{
inventory::Inventory,
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{
alerts::infra::opnsense::high_http_error_rate,
prometheus_alert_rule::AlertManagerRuleGroup,
},
cluster_alerting::ClusterAlertingScore,
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
},
topology::{
K8sAnywhereTopology,
monitoring::{AlertMatcher, AlertRoute, MatchOp},
alert_channel::discord_alert_channel::DiscordWebhook,
okd::cluster_monitoring::OpenshiftClusterAlertScore,
},
topology::K8sAnywhereTopology,
};
use harmony_macros::{hurl, ip};
use harmony_macros::hurl;
use harmony_types::k8s_name::K8sName;
#[tokio::main]
async fn main() {
let critical_receiver = DiscordReceiver {
name: "critical-alerts".to_string(),
url: hurl!("https://discord.example.com/webhook/critical"),
route: AlertRoute {
matchers: vec![AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
}],
..AlertRoute::default("critical-alerts".to_string())
},
};
let warning_receiver = DiscordReceiver {
name: "warning-alerts".to_string(),
url: hurl!("https://discord.example.com/webhook/warning"),
route: AlertRoute {
matchers: vec![AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "warning".to_string(),
}],
repeat_interval: Some("30m".to_string()),
..AlertRoute::default("warning-alerts".to_string())
},
};
let additional_rules =
AlertManagerRuleGroup::new("infra-alerts", vec![high_http_error_rate()]);
let firewall_scraper = PrometheusNodeExporter {
job_name: "firewall".to_string(),
metrics_path: "/metrics".to_string(),
listen_address: ip!("192.168.1.1"),
port: 9100,
..Default::default()
};
let alerting_score = ClusterAlertingScore::new()
.critical_receiver(Box::new(critical_receiver))
.warning_receiver(Box::new(warning_receiver))
.additional_rule(Box::new(additional_rules))
.scrape_target(Box::new(firewall_scraper));
let mut sel = HashMap::new();
sel.insert(
"openshift_io_alert_source".to_string(),
"platform".to_string(),
);
let mut sel2 = HashMap::new();
sel2.insert("openshift_io_alert_source".to_string(), "".to_string());
let selectors = vec![sel, sel2];
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(alerting_score)],
vec![Box::new(OpenshiftClusterAlertScore {
receivers: vec![Box::new(DiscordWebhook {
name: K8sName("wills-discord-webhook-example".to_string()),
url: hurl!("https://something.io"),
selectors: selectors,
})],
})],
None,
)
.await

View File

@@ -6,7 +6,10 @@ use harmony::{
data::{FileContent, FilePath},
modules::{
inventory::HarmonyDiscoveryStrategy,
okd::{installation::OKDInstallationPipeline, ipxe::OKDIpxeScore},
okd::{
installation::OKDInstallationPipeline, ipxe::OKDIpxeScore,
load_balancer::OKDLoadBalancerScore,
},
},
score::Score,
topology::HAClusterTopology,
@@ -32,6 +35,7 @@ async fn main() {
scores
.append(&mut OKDInstallationPipeline::get_all_scores(HarmonyDiscoveryStrategy::MDNS).await);
scores.push(Box::new(OKDLoadBalancerScore::new(&topology)));
harmony_cli::run(inventory, topology, scores, None)
.await
.unwrap();

View File

@@ -15,7 +15,6 @@ async fn main() {
..Default::default() // Use harmony defaults, they are based on CNPG's default values :
// 1 instance, 1Gi storage
},
hostname: "postgrestest.sto1.nationtech.io".to_string(),
};
let test_connection = PostgreSQLConnectionScore {

View File

@@ -6,9 +6,9 @@ use harmony::{
application::{
ApplicationScore, RustWebFramework, RustWebapp, features::rhob_monitoring::Monitoring,
},
monitoring::alert_channel::discord_alert_channel::DiscordReceiver,
monitoring::alert_channel::discord_alert_channel::DiscordWebhook,
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute},
topology::K8sAnywhereTopology,
};
use harmony_types::{k8s_name::K8sName, net::Url};
@@ -22,21 +22,18 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
selectors: vec![],
};
let app = ApplicationScore {
features: vec![
// Box::new(Monitoring {
// application: application.clone(),
// alert_receiver: vec![Box::new(discord_receiver)],
// }),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![Box::new(discord_receiver)],
}),
// TODO add backups, multisite ha, etc
],
application,

View File

@@ -8,13 +8,13 @@ use harmony::{
features::{Monitoring, PackagingDeployment},
},
monitoring::alert_channel::{
discord_alert_channel::DiscordReceiver, webhook_receiver::WebhookReceiver,
discord_alert_channel::DiscordWebhook, webhook_receiver::WebhookReceiver,
},
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute},
topology::K8sAnywhereTopology,
};
use harmony_macros::hurl;
use harmony_types::{k8s_name::K8sName, net::Url};
use harmony_types::k8s_name::K8sName;
#[tokio::main]
async fn main() {
@@ -26,23 +26,15 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: hurl!("https://discord.doesnt.exist.com"),
selectors: vec![],
};
let receiver_name = "sample-webhook-receiver".to_string();
let webhook_receiver = WebhookReceiver {
name: receiver_name.clone(),
name: "sample-webhook-receiver".to_string(),
url: hurl!("https://webhook-doesnt-exist.com"),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
};
let app = ApplicationScore {
@@ -50,10 +42,10 @@ async fn main() {
Box::new(PackagingDeployment {
application: application.clone(),
}),
// Box::new(Monitoring {
// application: application.clone(),
// alert_receiver: vec![Box::new(discord_receiver), Box::new(webhook_receiver)],
// }),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![Box::new(discord_receiver), Box::new(webhook_receiver)],
}),
// TODO add backups, multisite ha, etc
],
application,

View File

@@ -1,8 +1,11 @@
use harmony::{
inventory::Inventory,
modules::application::{
ApplicationScore, RustWebFramework, RustWebapp,
features::{Monitoring, PackagingDeployment},
modules::{
application::{
ApplicationScore, RustWebFramework, RustWebapp,
features::{Monitoring, PackagingDeployment},
},
monitoring::alert_channel::discord_alert_channel::DiscordWebhook,
},
topology::K8sAnywhereTopology,
};
@@ -27,14 +30,14 @@ async fn main() {
Box::new(PackagingDeployment {
application: application.clone(),
}),
// Box::new(Monitoring {
// application: application.clone(),
// alert_receiver: vec![Box::new(DiscordWebhook {
// name: K8sName("test-discord".to_string()),
// url: hurl!("https://discord.doesnt.exist.com"),
// selectors: vec![],
// })],
// }),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![Box::new(DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: hurl!("https://discord.doesnt.exist.com"),
selectors: vec![],
})],
}),
],
application,
};

View File

@@ -44,6 +44,7 @@ fn build_large_score() -> LoadBalancerScore {
],
listening_port: SocketAddr::V4(SocketAddrV4::new(ipv4!("192.168.0.0"), 49387)),
health_check: Some(HealthCheck::HTTP(
Some(1993),
"/some_long_ass_path_to_see_how_it_is_displayed_but_it_has_to_be_even_longer"
.to_string(),
HttpMethod::GET,

View File

@@ -46,14 +46,6 @@ impl std::fmt::Debug for K8sClient {
}
impl K8sClient {
pub fn inner_client(&self) -> &Client {
&self.client
}
pub fn inner_client_clone(&self) -> Client {
self.client.clone()
}
/// Create a client, reading `DRY_RUN` from the environment.
pub fn new(client: Client) -> Self {
Self {

View File

@@ -2,13 +2,14 @@ use std::collections::HashMap;
use k8s_openapi::api::{
apps::v1::Deployment,
core::v1::{Node, ServiceAccount},
core::v1::{Namespace, Node, ServiceAccount},
};
use k8s_openapi::apiextensions_apiserver::pkg::apis::apiextensions::v1::CustomResourceDefinition;
use kube::api::ApiResource;
use kube::{
Error, Resource,
api::{Api, DynamicObject, GroupVersionKind, ListParams, ObjectList},
core::ErrorResponse,
runtime::conditions,
runtime::wait::await_condition,
};
@@ -313,4 +314,65 @@ impl K8sClient {
) -> Result<ObjectList<Node>, Error> {
self.list_resources(None, list_params).await
}
pub async fn namespace_exists(&self, name: &str) -> Result<bool, Error> {
let api: Api<Namespace> = Api::all(self.client.clone());
match api.get_opt(name).await? {
Some(_) => Ok(true),
None => Ok(false),
}
}
pub async fn create_namespace(&self, name: &str) -> Result<Namespace, Error> {
let namespace = Namespace {
metadata: k8s_openapi::apimachinery::pkg::apis::meta::v1::ObjectMeta {
name: Some(name.to_string()),
..Default::default()
},
..Default::default()
};
let api: Api<Namespace> = Api::all(self.client.clone());
api.create(&kube::api::PostParams::default(), &namespace)
.await
}
pub async fn wait_for_namespace(
&self,
name: &str,
timeout: Option<Duration>,
) -> Result<(), Error> {
let api: Api<Namespace> = Api::all(self.client.clone());
let timeout = timeout.unwrap_or(Duration::from_secs(60));
let start = std::time::Instant::now();
loop {
if start.elapsed() > timeout {
return Err(Error::Api(ErrorResponse {
status: "Timeout".to_string(),
message: format!("Namespace '{}' not ready within timeout", name),
reason: "Timeout".to_string(),
code: 408,
}));
}
match api.get_opt(name).await? {
Some(ns) => {
if let Some(status) = ns.status {
if status.phase == Some("Active".to_string()) {
return Ok(());
}
}
}
None => {
return Err(Error::Api(ErrorResponse {
status: "NotFound".to_string(),
message: format!("Namespace '{}' not found", name),
reason: "NotFound".to_string(),
code: 404,
}));
}
}
tokio::time::sleep(Duration::from_millis(500)).await;
}
}
}

View File

@@ -42,7 +42,7 @@ impl Default for DrainOptions {
Self {
delete_emptydir_data: false,
ignore_daemonsets: true,
timeout: Duration::from_secs(1),
timeout: Duration::from_secs(120),
}
}
}

View File

@@ -1,4 +1,4 @@
use std::{collections::BTreeMap, process::Command, sync::Arc};
use std::{collections::BTreeMap, process::Command, sync::Arc, time::Duration};
use async_trait::async_trait;
use base64::{Engine, engine::general_purpose};
@@ -8,7 +8,7 @@ use k8s_openapi::api::{
core::v1::{Pod, Secret},
rbac::v1::{ClusterRoleBinding, RoleRef, Subject},
};
use kube::api::{GroupVersionKind, ObjectMeta};
use kube::api::{DynamicObject, GroupVersionKind, ObjectMeta};
use log::{debug, info, trace, warn};
use serde::Serialize;
use tokio::sync::OnceCell;
@@ -29,7 +29,28 @@ use crate::{
score_cert_management::CertificateManagementScore,
},
k3d::K3DInstallationScore,
k8s::ingress::{K8sIngressScore, PathType},
monitoring::{
grafana::{grafana::Grafana, helm::helm_grafana::grafana_helm_chart_score},
kube_prometheus::crd::{
crd_alertmanager_config::CRDPrometheus,
crd_grafana::{
Grafana as GrafanaCRD, GrafanaCom, GrafanaDashboard,
GrafanaDashboardDatasource, GrafanaDashboardSpec, GrafanaDatasource,
GrafanaDatasourceConfig, GrafanaDatasourceJsonData,
GrafanaDatasourceSecureJsonData, GrafanaDatasourceSpec, GrafanaSpec,
},
crd_prometheuses::LabelSelector,
prometheus_operator::prometheus_operator_helm_chart_score,
rhob_alertmanager_config::RHOBObservability,
service_monitor::ServiceMonitor,
},
},
okd::{crd::ingresses_config::Ingress as IngressResource, route::OKDTlsPassthroughScore},
prometheus::{
k8s_prometheus_alerting_score::K8sPrometheusCRDAlertingScore,
prometheus::PrometheusMonitoring, rhob_alerting_score::RHOBAlertingScore,
},
},
score::Score,
topology::{TlsRoute, TlsRouter, ingress::Ingress},
@@ -38,6 +59,7 @@ use crate::{
use super::super::{
DeploymentTarget, HelmCommand, K8sclient, MultiTargetTopology, PreparationError,
PreparationOutcome, Topology,
oberservability::monitoring::AlertReceiver,
tenant::{
TenantConfig, TenantManager,
k8s::K8sTenantManager,
@@ -87,6 +109,13 @@ impl K8sclient for K8sAnywhereTopology {
#[async_trait]
impl TlsRouter for K8sAnywhereTopology {
async fn get_public_domain(&self) -> Result<String, String> {
match &self.config.public_domain {
Some(public_domain) => Ok(public_domain.to_string()),
None => Err("Public domain not available".to_string()),
}
}
async fn get_internal_domain(&self) -> Result<Option<String>, String> {
match self.get_k8s_distribution().await.map_err(|e| {
format!(
@@ -144,6 +173,216 @@ impl TlsRouter for K8sAnywhereTopology {
}
}
#[async_trait]
impl Grafana for K8sAnywhereTopology {
async fn ensure_grafana_operator(
&self,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
debug!("ensure grafana operator");
let client = self.k8s_client().await.unwrap();
let grafana_gvk = GroupVersionKind {
group: "grafana.integreatly.org".to_string(),
version: "v1beta1".to_string(),
kind: "Grafana".to_string(),
};
let name = "grafanas.grafana.integreatly.org";
let ns = "grafana";
let grafana_crd = client
.get_resource_json_value(name, Some(ns), &grafana_gvk)
.await;
match grafana_crd {
Ok(_) => {
return Ok(PreparationOutcome::Success {
details: "Found grafana CRDs in cluster".to_string(),
});
}
Err(_) => {
return self
.install_grafana_operator(inventory, Some("grafana"))
.await;
}
};
}
async fn install_grafana(&self) -> Result<PreparationOutcome, PreparationError> {
let ns = "grafana";
let mut label = BTreeMap::new();
label.insert("dashboards".to_string(), "grafana".to_string());
let label_selector = LabelSelector {
match_labels: label.clone(),
match_expressions: vec![],
};
let client = self.k8s_client().await?;
let grafana = self.build_grafana(ns, &label);
client.apply(&grafana, Some(ns)).await?;
//TODO change this to a ensure ready or something better than just a timeout
client
.wait_until_deployment_ready(
"grafana-grafana-deployment",
Some("grafana"),
Some(Duration::from_secs(30)),
)
.await?;
let sa_name = "grafana-grafana-sa";
let token_secret_name = "grafana-sa-token-secret";
let sa_token_secret = self.build_sa_token_secret(token_secret_name, sa_name, ns);
client.apply(&sa_token_secret, Some(ns)).await?;
let secret_gvk = GroupVersionKind {
group: "".to_string(),
version: "v1".to_string(),
kind: "Secret".to_string(),
};
let secret = client
.get_resource_json_value(token_secret_name, Some(ns), &secret_gvk)
.await?;
let token = format!(
"Bearer {}",
self.extract_and_normalize_token(&secret).unwrap()
);
debug!("creating grafana clusterrole binding");
let clusterrolebinding =
self.build_cluster_rolebinding(sa_name, "cluster-monitoring-view", ns);
client.apply(&clusterrolebinding, Some(ns)).await?;
debug!("creating grafana datasource crd");
let thanos_url = format!(
"https://{}",
self.get_domain("thanos-querier-openshift-monitoring")
.await
.unwrap()
);
let thanos_openshift_datasource = self.build_grafana_datasource(
"thanos-openshift-monitoring",
ns,
&label_selector,
&thanos_url,
&token,
);
client.apply(&thanos_openshift_datasource, Some(ns)).await?;
debug!("creating grafana dashboard crd");
let dashboard = self.build_grafana_dashboard(ns, &label_selector);
client.apply(&dashboard, Some(ns)).await?;
debug!("creating grafana ingress");
let grafana_ingress = self.build_grafana_ingress(ns).await;
grafana_ingress
.interpret(&Inventory::empty(), self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
Ok(PreparationOutcome::Success {
details: "Installed grafana composants".to_string(),
})
}
}
#[async_trait]
impl PrometheusMonitoring<CRDPrometheus> for K8sAnywhereTopology {
async fn install_prometheus(
&self,
sender: &CRDPrometheus,
_inventory: &Inventory,
_receivers: Option<Vec<Box<dyn AlertReceiver<CRDPrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let client = self.k8s_client().await?;
for monitor in sender.service_monitor.iter() {
client
.apply(monitor, Some(&sender.namespace))
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
}
Ok(PreparationOutcome::Success {
details: "successfuly installed prometheus components".to_string(),
})
}
async fn ensure_prometheus_operator(
&self,
sender: &CRDPrometheus,
_inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let po_result = self.ensure_prometheus_operator(sender).await?;
match po_result {
PreparationOutcome::Success { details: _ } => {
debug!("Detected prometheus crds operator present in cluster.");
return Ok(po_result);
}
PreparationOutcome::Noop => {
debug!("Skipping Prometheus CR installation due to missing operator.");
return Ok(po_result);
}
}
}
}
#[async_trait]
impl PrometheusMonitoring<RHOBObservability> for K8sAnywhereTopology {
async fn install_prometheus(
&self,
sender: &RHOBObservability,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<RHOBObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let po_result = self.ensure_cluster_observability_operator(sender).await?;
if po_result == PreparationOutcome::Noop {
debug!("Skipping Prometheus CR installation due to missing operator.");
return Ok(po_result);
}
let result = self
.get_cluster_observability_operator_prometheus_application_score(
sender.clone(),
receivers,
)
.await
.interpret(inventory, self)
.await;
match result {
Ok(outcome) => match outcome.status {
InterpretStatus::SUCCESS => Ok(PreparationOutcome::Success {
details: outcome.message,
}),
InterpretStatus::NOOP => Ok(PreparationOutcome::Noop),
_ => Err(PreparationError::new(outcome.message)),
},
Err(err) => Err(PreparationError::new(err.to_string())),
}
}
async fn ensure_prometheus_operator(
&self,
sender: &RHOBObservability,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
}
impl Serialize for K8sAnywhereTopology {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
@@ -348,6 +587,23 @@ impl K8sAnywhereTopology {
}
}
fn extract_and_normalize_token(&self, secret: &DynamicObject) -> Option<String> {
let token_b64 = secret
.data
.get("token")
.or_else(|| secret.data.get("data").and_then(|d| d.get("token")))
.and_then(|v| v.as_str())?;
let bytes = general_purpose::STANDARD.decode(token_b64).ok()?;
let s = String::from_utf8(bytes).ok()?;
let cleaned = s
.trim_matches(|c: char| c.is_whitespace() || c == '\0')
.to_string();
Some(cleaned)
}
pub async fn get_k8s_distribution(&self) -> Result<KubernetesDistribution, PreparationError> {
self.k8s_client()
.await?
@@ -407,6 +663,141 @@ impl K8sAnywhereTopology {
}
}
fn build_grafana_datasource(
&self,
name: &str,
ns: &str,
label_selector: &LabelSelector,
url: &str,
token: &str,
) -> GrafanaDatasource {
let mut json_data = BTreeMap::new();
json_data.insert("timeInterval".to_string(), "5s".to_string());
GrafanaDatasource {
metadata: ObjectMeta {
name: Some(name.to_string()),
namespace: Some(ns.to_string()),
..Default::default()
},
spec: GrafanaDatasourceSpec {
instance_selector: label_selector.clone(),
allow_cross_namespace_import: Some(true),
values_from: None,
datasource: GrafanaDatasourceConfig {
access: "proxy".to_string(),
name: name.to_string(),
r#type: "prometheus".to_string(),
url: url.to_string(),
database: None,
json_data: Some(GrafanaDatasourceJsonData {
time_interval: Some("60s".to_string()),
http_header_name1: Some("Authorization".to_string()),
tls_skip_verify: Some(true),
oauth_pass_thru: Some(true),
}),
secure_json_data: Some(GrafanaDatasourceSecureJsonData {
http_header_value1: Some(format!("Bearer {token}")),
}),
is_default: Some(false),
editable: Some(true),
},
},
}
}
fn build_grafana_dashboard(
&self,
ns: &str,
label_selector: &LabelSelector,
) -> GrafanaDashboard {
let graf_dashboard = GrafanaDashboard {
metadata: ObjectMeta {
name: Some(format!("grafana-dashboard-{}", ns)),
namespace: Some(ns.to_string()),
..Default::default()
},
spec: GrafanaDashboardSpec {
resync_period: Some("30s".to_string()),
instance_selector: label_selector.clone(),
datasources: Some(vec![GrafanaDashboardDatasource {
input_name: "DS_PROMETHEUS".to_string(),
datasource_name: "thanos-openshift-monitoring".to_string(),
}]),
json: None,
grafana_com: Some(GrafanaCom {
id: 17406,
revision: None,
}),
},
};
graf_dashboard
}
fn build_grafana(&self, ns: &str, labels: &BTreeMap<String, String>) -> GrafanaCRD {
let grafana = GrafanaCRD {
metadata: ObjectMeta {
name: Some(format!("grafana-{}", ns)),
namespace: Some(ns.to_string()),
labels: Some(labels.clone()),
..Default::default()
},
spec: GrafanaSpec {
config: None,
admin_user: None,
admin_password: None,
ingress: None,
persistence: None,
resources: None,
},
};
grafana
}
async fn build_grafana_ingress(&self, ns: &str) -> K8sIngressScore {
let domain = self.get_domain(&format!("grafana-{}", ns)).await.unwrap();
let name = format!("{}-grafana", ns);
let backend_service = format!("grafana-{}-service", ns);
K8sIngressScore {
name: fqdn::fqdn!(&name),
host: fqdn::fqdn!(&domain),
backend_service: fqdn::fqdn!(&backend_service),
port: 3000,
path: Some("/".to_string()),
path_type: Some(PathType::Prefix),
namespace: Some(fqdn::fqdn!(&ns)),
ingress_class_name: Some("openshift-default".to_string()),
}
}
async fn get_cluster_observability_operator_prometheus_application_score(
&self,
sender: RHOBObservability,
receivers: Option<Vec<Box<dyn AlertReceiver<RHOBObservability>>>>,
) -> RHOBAlertingScore {
RHOBAlertingScore {
sender,
receivers: receivers.unwrap_or_default(),
service_monitors: vec![],
prometheus_rules: vec![],
}
}
async fn get_k8s_prometheus_application_score(
&self,
sender: CRDPrometheus,
receivers: Option<Vec<Box<dyn AlertReceiver<CRDPrometheus>>>>,
service_monitors: Option<Vec<ServiceMonitor>>,
) -> K8sPrometheusCRDAlertingScore {
return K8sPrometheusCRDAlertingScore {
sender,
receivers: receivers.unwrap_or_default(),
service_monitors: service_monitors.unwrap_or_default(),
prometheus_rules: vec![],
};
}
async fn openshift_ingress_operator_available(&self) -> Result<(), PreparationError> {
let client = self.k8s_client().await?;
let gvk = GroupVersionKind {
@@ -572,6 +963,137 @@ impl K8sAnywhereTopology {
)),
}
}
async fn ensure_cluster_observability_operator(
&self,
sender: &RHOBObservability,
) -> Result<PreparationOutcome, PreparationError> {
let status = Command::new("sh")
.args(["-c", "kubectl get crd -A | grep -i rhobs"])
.status()
.map_err(|e| PreparationError::new(format!("could not connect to cluster: {}", e)))?;
if !status.success() {
if let Some(Some(k8s_state)) = self.k8s_state.get() {
match k8s_state.source {
K8sSource::LocalK3d => {
warn!(
"Installing observability operator is not supported on LocalK3d source"
);
return Ok(PreparationOutcome::Noop);
debug!("installing cluster observability operator");
todo!();
let op_score =
prometheus_operator_helm_chart_score(sender.namespace.clone());
let result = op_score.interpret(&Inventory::empty(), self).await;
return match result {
Ok(outcome) => match outcome.status {
InterpretStatus::SUCCESS => Ok(PreparationOutcome::Success {
details: "installed cluster observability operator".into(),
}),
InterpretStatus::NOOP => Ok(PreparationOutcome::Noop),
_ => Err(PreparationError::new(
"failed to install cluster observability operator (unknown error)".into(),
)),
},
Err(err) => Err(PreparationError::new(err.to_string())),
};
}
K8sSource::Kubeconfig => {
debug!(
"unable to install cluster observability operator, contact cluster admin"
);
return Ok(PreparationOutcome::Noop);
}
}
} else {
warn!(
"Unable to detect k8s_state. Skipping Cluster Observability Operator install."
);
return Ok(PreparationOutcome::Noop);
}
}
debug!("Cluster Observability Operator is already present, skipping install");
Ok(PreparationOutcome::Success {
details: "cluster observability operator present in cluster".into(),
})
}
async fn ensure_prometheus_operator(
&self,
sender: &CRDPrometheus,
) -> Result<PreparationOutcome, PreparationError> {
let status = Command::new("sh")
.args(["-c", "kubectl get crd -A | grep -i prometheuses"])
.status()
.map_err(|e| PreparationError::new(format!("could not connect to cluster: {}", e)))?;
if !status.success() {
if let Some(Some(k8s_state)) = self.k8s_state.get() {
match k8s_state.source {
K8sSource::LocalK3d => {
debug!("installing prometheus operator");
let op_score =
prometheus_operator_helm_chart_score(sender.namespace.clone());
let result = op_score.interpret(&Inventory::empty(), self).await;
return match result {
Ok(outcome) => match outcome.status {
InterpretStatus::SUCCESS => Ok(PreparationOutcome::Success {
details: "installed prometheus operator".into(),
}),
InterpretStatus::NOOP => Ok(PreparationOutcome::Noop),
_ => Err(PreparationError::new(
"failed to install prometheus operator (unknown error)".into(),
)),
},
Err(err) => Err(PreparationError::new(err.to_string())),
};
}
K8sSource::Kubeconfig => {
debug!("unable to install prometheus operator, contact cluster admin");
return Ok(PreparationOutcome::Noop);
}
}
} else {
warn!("Unable to detect k8s_state. Skipping Prometheus Operator install.");
return Ok(PreparationOutcome::Noop);
}
}
debug!("Prometheus operator is already present, skipping install");
Ok(PreparationOutcome::Success {
details: "prometheus operator present in cluster".into(),
})
}
async fn install_grafana_operator(
&self,
inventory: &Inventory,
ns: Option<&str>,
) -> Result<PreparationOutcome, PreparationError> {
let namespace = ns.unwrap_or("grafana");
info!("installing grafana operator in ns {namespace}");
let tenant = self.get_k8s_tenant_manager()?.get_tenant_config().await;
let mut namespace_scope = false;
if tenant.is_some() {
namespace_scope = true;
}
let _grafana_operator_score = grafana_helm_chart_score(namespace, namespace_scope)
.interpret(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()));
Ok(PreparationOutcome::Success {
details: format!(
"Successfully installed grafana operator in ns {}",
ns.unwrap()
),
})
}
}
#[derive(Clone, Debug)]
@@ -609,6 +1131,7 @@ pub struct K8sAnywhereConfig {
///
/// If the context name is not found, it will fail to initialize.
pub k8s_context: Option<String>,
public_domain: Option<String>,
}
impl K8sAnywhereConfig {
@@ -636,6 +1159,7 @@ impl K8sAnywhereConfig {
let mut kubeconfig: Option<String> = None;
let mut k8s_context: Option<String> = None;
let mut public_domain: Option<String> = None;
for part in env_var_value.split(',') {
let kv: Vec<&str> = part.splitn(2, '=').collect();
@@ -643,6 +1167,7 @@ impl K8sAnywhereConfig {
match kv[0].trim() {
"kubeconfig" => kubeconfig = Some(kv[1].trim().to_string()),
"context" => k8s_context = Some(kv[1].trim().to_string()),
"public_domain" => public_domain = Some(kv[1].trim().to_string()),
_ => {}
}
}
@@ -660,6 +1185,7 @@ impl K8sAnywhereConfig {
K8sAnywhereConfig {
kubeconfig,
k8s_context,
public_domain,
use_system_kubeconfig,
autoinstall: false,
use_local_k3d: false,
@@ -702,6 +1228,7 @@ impl K8sAnywhereConfig {
use_local_k3d: std::env::var("HARMONY_USE_LOCAL_K3D")
.map_or_else(|_| true, |v| v.parse().ok().unwrap_or(true)),
k8s_context: std::env::var("HARMONY_K8S_CONTEXT").ok(),
public_domain: std::env::var("HARMONY_PUBLIC_DOMAIN").ok(),
}
}
}

View File

@@ -1,5 +1,4 @@
mod k8s_anywhere;
pub mod nats;
pub mod observability;
mod postgres;
pub use k8s_anywhere::*;

View File

@@ -1,147 +0,0 @@
use async_trait::async_trait;
use crate::{
inventory::Inventory,
modules::monitoring::grafana::{
grafana::Grafana,
k8s::{
score_ensure_grafana_ready::GrafanaK8sEnsureReadyScore,
score_grafana_alert_receiver::GrafanaK8sReceiverScore,
score_grafana_datasource::GrafanaK8sDatasourceScore,
score_grafana_rule::GrafanaK8sRuleScore, score_install_grafana::GrafanaK8sInstallScore,
},
},
score::Score,
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<Grafana> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &Grafana,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = GrafanaK8sInstallScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Grafana not installed {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed grafana alert sender".to_string(),
})
}
async fn install_receivers(
&self,
sender: &Grafana,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<Grafana>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
let score = GrafanaK8sReceiverScore {
receiver,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install receiver: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert receivers installed successfully".to_string(),
})
}
async fn install_rules(
&self,
sender: &Grafana,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<Grafana>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let rules = match rules {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for rule in rules {
let score = GrafanaK8sRuleScore {
sender: sender.clone(),
rule,
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert rules installed successfully".to_string(),
})
}
async fn add_scrape_targets(
&self,
sender: &Grafana,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<Grafana>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let scrape_targets = match scrape_targets {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for scrape_target in scrape_targets {
let score = GrafanaK8sDatasourceScore {
scrape_target,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to add DataSource: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All datasources installed successfully".to_string(),
})
}
async fn ensure_monitoring_installed(
&self,
sender: &Grafana,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = GrafanaK8sEnsureReadyScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Grafana not ready {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Grafana Ready".to_string(),
})
}
}

View File

@@ -1,142 +0,0 @@
use async_trait::async_trait;
use crate::{
inventory::Inventory,
modules::monitoring::kube_prometheus::{
KubePrometheus, helm::kube_prometheus_helm_chart::kube_prometheus_helm_chart_score,
score_kube_prometheus_alert_receivers::KubePrometheusReceiverScore,
score_kube_prometheus_ensure_ready::KubePrometheusEnsureReadyScore,
score_kube_prometheus_rule::KubePrometheusRuleScore,
score_kube_prometheus_scrape_target::KubePrometheusScrapeTargetScore,
},
score::Score,
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
kube_prometheus_helm_chart_score(sender.config.clone())
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed kubeprometheus alert sender".to_string(),
})
}
async fn install_receivers(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<KubePrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
let score = KubePrometheusReceiverScore {
receiver,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install receiver: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert receivers installed successfully".to_string(),
})
}
async fn install_rules(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<KubePrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let rules = match rules {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for rule in rules {
let score = KubePrometheusRuleScore {
sender: sender.clone(),
rule,
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert rules installed successfully".to_string(),
})
}
async fn add_scrape_targets(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<KubePrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let scrape_targets = match scrape_targets {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for scrape_target in scrape_targets {
let score = KubePrometheusScrapeTargetScore {
scrape_target,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All scrap targets installed successfully".to_string(),
})
}
async fn ensure_monitoring_installed(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = KubePrometheusEnsureReadyScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("KubePrometheus not ready {}", e)))?;
Ok(PreparationOutcome::Success {
details: "KubePrometheus Ready".to_string(),
})
}
}

View File

@@ -1,5 +0,0 @@
pub mod grafana;
pub mod kube_prometheus;
pub mod openshift_monitoring;
pub mod prometheus;
pub mod redhat_cluster_observability;

View File

@@ -1,142 +0,0 @@
use async_trait::async_trait;
use log::info;
use crate::score::Score;
use crate::{
inventory::Inventory,
modules::monitoring::okd::{
OpenshiftClusterAlertSender,
score_enable_cluster_monitoring::OpenshiftEnableClusterMonitoringScore,
score_openshift_alert_rule::OpenshiftAlertRuleScore,
score_openshift_receiver::OpenshiftReceiverScore,
score_openshift_scrape_target::OpenshiftScrapeTargetScore,
score_user_workload::OpenshiftUserWorkloadMonitoring,
score_verify_user_workload_monitoring::VerifyUserWorkload,
},
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
info!("enabling cluster monitoring");
let cluster_monitoring_score = OpenshiftEnableClusterMonitoringScore {};
cluster_monitoring_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
info!("enabling user workload monitoring");
let user_workload_score = OpenshiftUserWorkloadMonitoring {};
user_workload_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
Ok(PreparationOutcome::Success {
details: "Successfully configured cluster monitoring".to_string(),
})
}
async fn install_receivers(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>>>,
) -> Result<PreparationOutcome, PreparationError> {
if let Some(receivers) = receivers {
for receiver in receivers {
info!("Installing receiver {}", receiver.name());
let receiver_score = OpenshiftReceiverScore { receiver };
receiver_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
}
Ok(PreparationOutcome::Success {
details: "Successfully installed receivers for OpenshiftClusterMonitoring"
.to_string(),
})
} else {
Ok(PreparationOutcome::Noop)
}
}
async fn install_rules(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<OpenshiftClusterAlertSender>>>>,
) -> Result<PreparationOutcome, PreparationError> {
if let Some(rules) = rules {
for rule in rules {
info!("Installing rule ");
let rule_score = OpenshiftAlertRuleScore { rule: rule };
rule_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
}
Ok(PreparationOutcome::Success {
details: "Successfully installed rules for OpenshiftClusterMonitoring".to_string(),
})
} else {
Ok(PreparationOutcome::Noop)
}
}
async fn add_scrape_targets(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<OpenshiftClusterAlertSender>>>>,
) -> Result<PreparationOutcome, PreparationError> {
if let Some(scrape_targets) = scrape_targets {
for scrape_target in scrape_targets {
info!("Installing scrape target");
let scrape_target_score = OpenshiftScrapeTargetScore {
scrape_target: scrape_target,
};
scrape_target_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
}
Ok(PreparationOutcome::Success {
details: "Successfully added scrape targets for OpenshiftClusterMonitoring"
.to_string(),
})
} else {
Ok(PreparationOutcome::Noop)
}
}
async fn ensure_monitoring_installed(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let verify_monitoring_score = VerifyUserWorkload {};
info!("Verifying user workload and cluster monitoring installed");
verify_monitoring_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
Ok(PreparationOutcome::Success {
details: "OpenshiftClusterMonitoring ready".to_string(),
})
}
}

View File

@@ -1,147 +0,0 @@
use async_trait::async_trait;
use crate::{
inventory::Inventory,
modules::monitoring::prometheus::{
Prometheus, score_prometheus_alert_receivers::PrometheusReceiverScore,
score_prometheus_ensure_ready::PrometheusEnsureReadyScore,
score_prometheus_install::PrometheusInstallScore,
score_prometheus_rule::PrometheusRuleScore,
score_prometheus_scrape_target::PrometheusScrapeTargetScore,
},
score::Score,
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<Prometheus> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &Prometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = PrometheusInstallScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Prometheus not installed {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed kubeprometheus alert sender".to_string(),
})
}
async fn install_receivers(
&self,
sender: &Prometheus,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<Prometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
let score = PrometheusReceiverScore {
receiver,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install receiver: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert receivers installed successfully".to_string(),
})
}
async fn install_rules(
&self,
sender: &Prometheus,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<Prometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let rules = match rules {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for rule in rules {
let score = PrometheusRuleScore {
sender: sender.clone(),
rule,
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert rules installed successfully".to_string(),
})
}
async fn add_scrape_targets(
&self,
sender: &Prometheus,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<Prometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let scrape_targets = match scrape_targets {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for scrape_target in scrape_targets {
let score = PrometheusScrapeTargetScore {
scrape_target,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All scrap targets installed successfully".to_string(),
})
}
async fn ensure_monitoring_installed(
&self,
sender: &Prometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = PrometheusEnsureReadyScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Prometheus not ready {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Prometheus Ready".to_string(),
})
}
}

View File

@@ -1,116 +0,0 @@
use crate::{
modules::monitoring::red_hat_cluster_observability::{
score_alert_receiver::RedHatClusterObservabilityReceiverScore,
score_coo_monitoring_stack::RedHatClusterObservabilityMonitoringStackScore,
},
score::Score,
};
use async_trait::async_trait;
use log::info;
use crate::{
inventory::Inventory,
modules::monitoring::red_hat_cluster_observability::{
RedHatClusterObservability,
score_redhat_cluster_observability_operator::RedHatClusterObservabilityOperatorScore,
},
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<RedHatClusterObservability> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &RedHatClusterObservability,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
info!("Verifying Redhat Cluster Observability Operator");
let coo_score = RedHatClusterObservabilityOperatorScore::default();
coo_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
info!(
"Installing Cluster Observability Operator Monitoring Stack in ns {}",
sender.namespace.clone()
);
let coo_monitoring_stack_score = RedHatClusterObservabilityMonitoringStackScore {
namespace: sender.namespace.clone(),
resource_selector: sender.resource_selector.clone(),
};
coo_monitoring_stack_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed RedHatClusterObservability Operator".to_string(),
})
}
async fn install_receivers(
&self,
sender: &RedHatClusterObservability,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<RedHatClusterObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
info!("Installing receiver {}", receiver.name());
let receiver_score = RedHatClusterObservabilityReceiverScore {
receiver,
sender: sender.clone(),
};
receiver_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
}
Ok(PreparationOutcome::Success {
details: "Successfully installed receivers for OpenshiftClusterMonitoring".to_string(),
})
}
async fn install_rules(
&self,
_sender: &RedHatClusterObservability,
_inventory: &Inventory,
_rules: Option<Vec<Box<dyn AlertRule<RedHatClusterObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
async fn add_scrape_targets(
&self,
_sender: &RedHatClusterObservability,
_inventory: &Inventory,
_scrape_targets: Option<Vec<Box<dyn ScrapeTarget<RedHatClusterObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
async fn ensure_monitoring_installed(
&self,
_sender: &RedHatClusterObservability,
_inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
}

View File

@@ -106,6 +106,7 @@ pub enum SSL {
#[derive(Debug, Clone, PartialEq, Serialize)]
pub enum HealthCheck {
HTTP(String, HttpMethod, HttpStatusCode, SSL),
/// HTTP(None, "/healthz/ready", HttpMethod::GET, HttpStatusCode::Success2xx, SSL::Disabled)
HTTP(Option<u16>, String, HttpMethod, HttpStatusCode, SSL),
TCP(Option<u16>),
}

View File

@@ -2,7 +2,6 @@ pub mod decentralized;
mod failover;
mod ha_cluster;
pub mod ingress;
pub mod monitoring;
pub mod node_exporter;
pub mod opnsense;
pub use failover::*;
@@ -12,6 +11,7 @@ mod http;
pub mod installable;
mod k8s_anywhere;
mod localhost;
pub mod oberservability;
pub mod tenant;
use derive_new::new;
pub use k8s_anywhere::*;

Some files were not shown because too many files have changed in this diff Show More