Compare commits

..

136 Commits

Author SHA1 Message Date
5abc1b217c wip fix load balancer idempotency
Some checks failed
Run Check Script / check (pull_request) Failing after 7s
2026-04-02 16:44:48 -04:00
a7d1abd0be wip fix load balancer idempotency
Some checks failed
Run Check Script / check (pull_request) Failing after 9s
2026-04-02 16:34:54 -04:00
92b46c5c08 fix: haproxy listens on 0.0.0.0 and opens wan in the firewall for okd deployment, also disables http redirect rule for opnsense webgui which stole haproxy traffic
Some checks failed
Run Check Script / check (pull_request) Failing after 10s
2026-04-01 22:31:54 -04:00
d937813fd4 ignore stress test config and db files
Some checks failed
Run Check Script / check (pull_request) Failing after 10s
2026-04-01 21:12:08 -04:00
8b5ca51fba chore: Improve opnsene constructor signature
Some checks failed
Run Check Script / check (pull_request) Failing after 11s
2026-04-01 21:10:31 -04:00
3afaa38ba0 feat: Network stress test utility that will randomly flap switch ports and reboot opnsense firewalls while running iperf and report statistics and events in a simple clean ui 2026-04-01 21:09:02 -04:00
1a0e754c7a chore: Note some problems, improve some variables naming around opnsense automation
Some checks failed
Run Check Script / check (pull_request) Failing after 18s
2026-03-31 17:33:04 -04:00
0dc9b80010 chore: fix unused import and add TODO/doc comments from review
Some checks failed
Run Check Script / check (pull_request) Failing after 11s
- Remove unused `warn` import in pair integration example
- Add TODO comment for shared credentials limitation (ROADMAP/11)
- Add doc comments on DhcpServer::get_ip/get_host noting they return
  primary's address, not the CARP VIP
2026-03-31 13:17:44 -04:00
6554ac5341 docs: fix pair integration subnet in diagram, add to examples index
- Fixed network topology diagram in pair README: 192.168.10.x -> 192.168.1.x
  to match the actual code (OPNsense boots on .1 of 192.168.1.0/24)
- Added explanation of NIC juggling to the diagram section
- Updated single-VM "What's next" to link to pair example (was "in progress")
- Added opnsense_pair_integration to examples/README.md table and category
2026-03-31 12:29:35 -04:00
811c56086c fix(kvm): fix domiflist MAC parsing and pair test subnet
- Fixed VmInterface parsing: virsh domiflist has 5 columns (Interface,
  Type, Source, Model, MAC), not 4. MAC is at index 4, not 3.
- Changed pair integration subnet to 192.168.1.0/24 to match OPNsense's
  hard-coded default boot IP of .1.

Tested: full --full pair integration passes end-to-end with CARP VIP
configured on both firewalls (primary advskew=0, backup advskew=100).
2026-03-31 12:26:34 -04:00
34d02d7291 feat(opnsense): add firewall pair VM integration example
Boots two OPNsense VMs, bootstraps both with NIC juggling to handle
the .1 IP conflict, then applies FirewallPairTopology with CarpVipScore.

The bootstrap sequence:
1. Boot both VMs on shared LAN bridge
2. Disable backup's LAN NIC
3. Bootstrap primary on .1, change IP to .2
4. Swap NICs (disable primary, enable backup)
5. Bootstrap backup on .1, change IP to .3
6. Re-enable all NICs
7. Apply pair scores (CARP VIP, VLANs, firewall rules)
8. Verify via API on both firewalls

Supports --full flag for single-shot CI execution.
2026-03-31 12:07:40 -04:00
73785e7336 feat(kvm): add NIC link control for VM interface management
Adds set_interface_link() and list_interfaces() to KvmExecutor,
enabling programmatic up/down control of VM network interfaces by
MAC address.

This is essential for bootstrapping multiple VMs that boot with the
same default IP (e.g., OPNsense on 192.168.1.1) — disable all LAN
NICs, then enable and bootstrap one at a time.

Uses virsh domif-setlink and domiflist under the hood. Tested against
a live KVM VM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 12:02:09 -04:00
8abcb68865 docs: update OPNsense VM integration for fully automated bootstrap
Major rewrite of OPNsense documentation to reflect the new unattended
workflow — no manual browser interaction required.

- Rewrote examples/opnsense_vm_integration/README.md: highlights --full
  CI mode, documents OPNsenseBootstrap automated steps, lists system
  requirements by distro
- Rewrote docs/use-cases/opnsense-vm-integration.md: removed manual
  Step 3 (SSH/webgui), added Phase 2 bootstrap description, updated
  architecture diagram with OPNsenseBootstrap layer
- Added OPNsense VM Integration to docs/README.md (was missing)
- Added OPNsense VM Integration to docs/use-cases/README.md (was missing)
- Added opnsense_vm_integration to examples/README.md quick reference
  table and Infrastructure category (was missing, marked as recommended)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:16:42 -04:00
35aab0ecfb fix(opnsense): fix bootstrap webgui port change and add SSH diagnostics
Fixes:
- CSRF token parser now extracts <input> tags individually instead of
  parsing whole lines, fixing the bug where <form name="iform"> on the
  same line as the CSRF hidden input caused the wrong name to be extracted
- extract_selected_option() for <select> dropdowns (webguiproto,
  ssl-certref) which extract_input_value() couldn't handle
- After webgui port change, explicitly restart lighttpd via SSH
  (configctl webgui restart) as a safety net — the PHP configd call
  can fail if lighttpd dies before executing it

Adds:
- diagnose_via_ssh() reports webgui config, listening ports, lighttpd
  process status, and configctl status — invaluable for troubleshooting
- Diagnostic output is shown automatically when wait_for_ready() fails

Tested: full --boot + integration test passes end-to-end with zero
manual interaction on a fresh OPNsense 26.1 VM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 08:43:44 -04:00
ddab4d27eb feat(opnsense): integrate OPNsenseBootstrap into VM integration example
Replaces the manual browser steps (wizard, SSH, webgui port) with
automated OPNsenseBootstrap calls. Adds --full flag for CI-friendly
single-shot boot + test.

Working: login, wizard abort, SSH enable with root+password auth.
In progress: webgui port change (lighttpd falls back to port 80 —
needs fix for <select> dropdown extraction and CSRF token refresh).

Also adds:
- diagnose_via_ssh() for troubleshooting webgui status
- restart_webgui_via_ssh() safety net after port changes
- CSRF parser fix for same-line form+input HTML (real OPNsense layout)
- cookie_store(true) for reliable session management

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:59:04 -04:00
79d8aa39fc feat(opnsense): add OPNsenseBootstrap for unattended first-boot setup
Automates OPNsense initial setup via HTTP session authentication,
eliminating manual browser interaction. The module:

- Logs in with username/password (handles CSRF token extraction)
- Aborts the initial setup wizard via /api/core/initial_setup/abort
- Enables SSH with root login and password auth
- Changes the web GUI port (fire-and-forget, handles server restart)
- Provides wait_for_ready() polling helper

Uses reqwest with cookie jar for session management. No browser or
external dependencies needed — pure Rust HTTP client approach.

Includes unit tests for CSRF token extraction and HTML parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 06:09:30 -04:00
5e8e63ade7 test(opnsense): add unit tests for FirewallPairTopology
Tests cover:
- ensure_ready outcome merging (both Success)
- CarpVipScore applies VIPs to both firewalls with correct advskew
- CarpVipScore custom backup_advskew is respected
- CarpVipScore defaults backup_advskew to 100 when unset
- VlanScore uniform delegation applies to both firewalls

Uses httptest mock HTTP servers to intercept OPNsense API calls
without requiring real firewall devices. Adds httptest dev-dependency
to harmony crate and a #[cfg(test)] from_config constructor on
OPNSenseFirewall for test-friendly instantiation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 05:27:24 -04:00
cb2a650d8b feat(opnsense): add FirewallPairTopology for HA firewall pair management
Introduces a higher-order topology that wraps two OPNSenseFirewall
instances (primary + backup) and orchestrates score application across
both. CARP VIPs get differentiated advskew values (primary=0,
backup=configurable) while all other scores apply identically to both
firewalls.

Includes CarpVipScore, DhcpServer delegation, pair Score impls for all
existing OPNsense scores, and opnsense_from_config() factory method.

Also adds ROADMAP entries for generic firewall trait (10), delegation
macro, integration tests, and named config instances (11).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 05:12:03 -04:00
466a8aafd1 feat(postgresql): add wait_for_ready option to PostgreSQLConfig
Some checks failed
Run Check Script / check (pull_request) Failing after 12s
Add wait_for_ready field (default: true) to PostgreSQLConfig. When
enabled, K8sPostgreSQLInterpret waits for the cluster's -rw service
to exist after applying the Cluster CR, ensuring callers like
get_endpoint() succeed immediately.

This eliminates the retry loop in the harmony_sso example's
deploy_zitadel() -- ZitadelScore now deploys in a single pass because
the PG service is guaranteed to exist before Zitadel's Helm chart
init job tries to connect.

The deploy_zitadel function shrinks from a 5-attempt retry loop to a
simple score.interpret() call.
2026-03-30 08:45:49 -04:00
fabec7ac11 refactor: extract CoreDNSRewriteScore from harmony_sso example
Some checks failed
Run Check Script / check (pull_request) Failing after 11m35s
Move CoreDNS rewrite logic into a reusable Score at
harmony/src/modules/k8s/coredns.rs. The Score patches CoreDNS on
K3sFamily clusters to add name rewrite rules (e.g., mapping
sso.harmony.local to the in-cluster service FQDN).

K3sFamily/Default only, no-op on OpenShift. Idempotent.

The harmony_sso example now uses CoreDNSRewriteScore.interpret()
instead of an inline function.
2026-03-30 07:56:44 -04:00
3e2b8423e8 chore: clean up clippy warnings in zitadel and openbao modules
- Remove unused serde default functions in ZitadelSetupScore
- Replace redundant closures with function references (InterpretError::new)
- Allow dead_code on AppSearchEntry.id (needed for deserialization)
- Fix empty line after doc comment in ZitadelScore
- Remove unneeded return statement in generate_secure_password
2026-03-30 07:46:47 -04:00
c687d4e6b3 docs: add Phase 9 (SSO + Config Hardening) to roadmap
New roadmap phase covering the hardening path for the SSO config
management stack: builder pattern for OpenbaoSecretStore, ZitadelScore
PG readiness fix, CoreDNSRewriteScore, integration tests, and future
capability traits.

Updates current state to reflect implemented Zitadel OIDC integration
and harmony_sso example.
2026-03-30 07:37:24 -04:00
cd48675027 chore: cargo fmt 2026-03-29 09:01:00 -04:00
8cb59cc029 fix: SSO end-to-end fixes for device flow
- OpenbaoSetupScore: verify vault init state before trusting cached
  keys (handles cluster recreation with stale local keys file)
- ZitadelSetupScore: trim PAT whitespace (K8s secret had trailing
  newline that corrupted the Authorization header)
- ZitadelOidcAuth: resolve SSO hostname to 127.0.0.1 via reqwest
  resolve() so device flow works without /etc/hosts entries
- Fix OIDC discovery URL to include port (Zitadel issuer is
  http://sso.harmony.local:8080, not http://sso.harmony.local)

The full SSO flow now works end-to-end: deploy, provision identity,
configure JWT auth, trigger device flow. User sees verification URL
and code in the terminal.
2026-03-29 08:54:28 -04:00
772fcad3d7 refactor(harmony-sso): full SSO flow as default deployment
The example now deploys the complete SSO stack and uses it:

Phase 1: Deploy OpenBao + basic setup (init, unseal, policies, users)
Phase 2: CoreDNS patch + Deploy Zitadel + ZitadelSetupScore (creates
  project + device-code app) + OpenBao JWT auth (with real client_id)
Phase 3: Store config via SSO-authenticated OpenBao (triggers device
  flow on first run, uses cached session on re-run)

Removed --demo and --sso-demo flags. The default run IS the demo.
Kept --skip-zitadel and --cleanup.

On re-run: all deployments are idempotent, cached OIDC session is
reused, config is loaded from OpenBao without login prompt.
2026-03-29 08:37:25 -04:00
80e512caf7 feat(harmony-secret): implement JWT exchange for Zitadel OIDC -> OpenBao
Fix the core SSO authentication flow: instead of storing the Zitadel
access_token as the OpenBao token (which OpenBao doesn't recognize),
exchange the id_token with OpenBao's JWT auth method via
POST /v1/auth/{mount}/login to get a real OpenBao client token.

Changes:
- ZitadelOidcAuth: add openbao_url, jwt_auth_mount, jwt_role fields
- New exchange_jwt_for_openbao_token() method using reqwest (vaultrs
  0.7.4 has no JWT auth module)
- process_token_response() now exchanges id_token when openbao_url is
  set, falls back to access_token for backward compat
- OpenbaoSecretStore::new() accepts optional jwt_role + jwt_auth_mount
- All callers updated (lib.rs, openbao_chain example, harmony_sso)

This implements ADR 020-1 Step 6 (OpenBao JWT exchange).
2026-03-29 08:35:43 -04:00
d0b7c03e12 feat(zitadel): add ZitadelSetupScore for identity provisioning
New Score that provisions identity resources in a deployed Zitadel
instance via the Management API v1:
- Create projects
- Create OIDC applications (device-code grant for CLI/headless)
- Machine user provisioning (stubbed for future iteration)

Authenticates using the admin PAT from the iam-admin-pat K8s secret
(provisioned automatically by the Zitadel Helm chart). No password
extraction or deprecated grant types needed.

All operations are idempotent: checks for existing resources before
creating. Results cached at ~/.local/share/harmony/zitadel/client-config.json.

This is the "day two" counterpart to ZitadelScore, enabling enterprise
automation of identity management (users, machines, applications, groups).
2026-03-29 08:31:49 -04:00
4a66880a84 fix(harmony-k8s): make API discovery cache invalidatable
Replace OnceCell<Discovery> with RwLock<Option<Arc<Discovery>>> so the
cache can be cleared after installing CRDs or operators that register
new API groups.

Add invalidate_discovery() method. Call it in ensure_cnpg_operator()
after confirming the Cluster CRD is registered, so the subsequent
apply() call sees the new CRD without needing a fresh client.

This eliminates the "Cannot resolve GVK" retry loop -- PostgreSQL
Cluster resources now apply on the first attempt after CNPG operator
installation.
2026-03-29 07:30:33 -04:00
ec1bdbab73 feat(harmony-sso): add CoreDNS rewrite for in-cluster hostname resolution
Patch CoreDNS on K3sFamily to add rewrite rules that map external
hostnames (sso.harmony.local, bao.harmony.local) to cluster service
FQDNs. This allows OpenBao's JWT auth to fetch Zitadel's JWKS from
inside the cluster, where Zitadel validates Host headers against its
ExternalDomain.

Uses apply_dynamic with force_conflicts since the CoreDNS ConfigMap
is owned by the k3d deployer. Restarts CoreDNS pods after patching.
No-op on non-K3sFamily distributions (OpenShift, etc.).

Idempotent: skips patching if rewrite rules already present.
2026-03-29 07:22:54 -04:00
09b704e9cf fix(postgresql): wait for CNPG CRD registration after operator install
The CNPG operator deployment being ready does not guarantee that the
Cluster CRD is registered in the API server's discovery cache. This
caused intermittent "Cannot resolve GVK: postgresql.cnpg.io/v1/Cluster"
errors when applying PostgreSQL Cluster resources immediately after
operator installation.

Add wait_for_crd() to harmony-k8s that polls has_crd() until the CRD
appears (2s interval, 60s timeout). Call it in ensure_cnpg_operator()
after the deployment readiness check.

This eliminates the need for retry loops in callers like harmony_sso.
2026-03-29 07:11:34 -04:00
8e3e935459 refactor(harmony-sso): use OpenbaoSetupScore instead of imperative orchestration
Replace ~200 lines of manual init/unseal/configure/jwt-auth code with
a single OpenbaoSetupScore invocation. The deployment path is now:

1. OpenbaoScore (Helm deploy)
2. OpenbaoSetupScore (init, unseal, policies, users, JWT auth)
3. ZitadelScore (CNPG + Helm, with retry)

The example main.rs goes from ~800 lines to ~370 lines. The removed
imperative logic now lives in the reusable OpenbaoSetupScore which can
be tested against any topology.
2026-03-29 06:45:36 -04:00
c388d5234f feat(openbao): add OpenbaoSetupScore for post-deployment lifecycle
New Score that handles the operational complexity of making a deployed
OpenBao instance operational:
- Init (operator init) with local key storage (~/.local/share/harmony/openbao/)
- Unseal (3 of 5 keys)
- Enable KV v2 secrets engine
- Create configurable policies (HCL)
- Enable userpass auth and create users
- Optional JWT auth configuration for OIDC integration

All steps are idempotent. Requires T: Topology + K8sclient.

This encapsulates the tribal knowledge of OpenBao lifecycle management
into a compiled, type-checked Score that can be tested against any
topology (k3d, OpenShift, kubeadm, bare metal).
2026-03-28 23:51:57 -04:00
d9d5ea718f docs: add Score design principles and capability architecture rules
docs/guides/writing-a-score.md:
- Add Design Principles section: capabilities are industry concepts not
  tools, Scores encapsulate operational complexity, idempotency rules,
  no execution order dependencies

CLAUDE.md:
- Add Capability and Score Design Rules section with the swap test:
  if swapping the underlying tool breaks Scores, the capability
  boundary is wrong
2026-03-28 23:48:12 -04:00
5415452f15 refactor(harmony-sso): replace kubectl with typed K8s APIs, add Zitadel deployment
Replace all Command::new("kubectl") calls with harmony-k8s K8sClient
methods:
- wait_for_pod_ready() instead of kubectl get pod jsonpath
- exec_pod_capture_output() for OpenBao init/unseal/configure
- delete_resource<MutatingWebhookConfiguration>() for webhook cleanup
- port_forward() instead of kubectl port-forward subprocess

Thread K3d and K8sClient through all functions instead of
reconstructing context strings. Consolidate path helpers into
harmony_data_dir().

Add Zitadel deployment via ZitadelScore with retry logic for CNPG CRD
registration race and PostgreSQL cluster readiness timing.

Add CLI flags: --demo, --sso-demo, --skip-zitadel, --cleanup.
Add --demo mode: ConfigManager with EnvSource + StoreSource<OpenbaoSecretStore>.
Configure OpenBao with harmony-dev policy, userpass auth, and JWT auth.
2026-03-28 23:48:00 -04:00
b05a341a80 feat(harmony-k8s, k3d): add exec_pod, delete_resource, port_forward, and k3d getters
harmony-k8s:
- exec_pod() and exec_pod_capture_output(): exec commands in pods by
  name (not just label), with proper stdout/stderr capture
- delete_resource<K>(): generic typed delete using ScopeResolver,
  idempotent (404 = Ok)
- port_forward(): native port forwarding via kube-rs Portforwarder +
  tokio TcpListener, replacing kubectl subprocess. Returns
  PortForwardHandle that auto-aborts on drop.

k3d:
- base_dir(), cluster_name(), context_name() public getters

Also adds tokio "net" feature to workspace for TcpListener.
2026-03-28 23:47:42 -04:00
d0252bf1dc wip: harmony_sso example deploying zitadel and openbao seems to be working for config backend!
Some checks failed
Run Check Script / check (pull_request) Failing after 15s
2026-03-28 18:20:01 -04:00
f33d730645 fix(opnsense): improve idempotency in VIP, LAGG, and firewall modules
VIP: Fix subnet matching from starts_with() to exact equality. Previously
"192.168.1.10" would wrongly match a request for "192.168.1.100".

LAGG: Add config diff detection when updating existing LAGGs. Logs a
warning with previous config when protocol, description, or MTU differs
from desired state.

Firewall: Detect duplicate rules with same description and warn. When
multiple rules share a description, updates the first one and logs a
warning suggesting unique descriptions.

7 new tests proving:
- VIP exact subnet match (rejects prefix match, finds exact, mode check)
- Firewall create/update/duplicate/different-description scenarios
2026-03-28 13:48:29 -04:00
6040e2394e add claude.md
Some checks failed
Run Check Script / check (pull_request) Failing after 16s
2026-03-28 13:33:04 -04:00
a7f9b1037a refactor: push harmony_types enums all the way down to opnsense-api
Some checks failed
Run Check Script / check (pull_request) Failing after 19s
Move vendor-neutral IaC enums to harmony_types::firewall. Add From impls
in opnsense-api::wire converting harmony_types to generated OPNsense
types. Add typed methods in opnsense-config that accept harmony_types
enums and handle wire conversion internally.

Score layer no longer builds serde_json::json!() bodies — it passes
harmony_types enums directly to opnsense-config typed methods:
  ensure_filter_rule(&FirewallAction, &Direction, &IpProtocol, ...)
  ensure_snat_rule_from(&IpProtocol, &NetworkProtocol, ...)
  ensure_dnat_rule(&IpProtocol, &NetworkProtocol, ...)
  ensure_vip_from(&VipMode, ...)
  ensure_lagg(..., &LaggProtocol, ...)

Type flow: harmony_types → Score → opnsense-config → From<> → generated → wire
No strings cross layer boundaries for typed fields.
2026-03-26 11:07:49 -04:00
b98b2aa3f7 refactor: move IaC enums to harmony_types, translate in opnsense-api
Move vendor-neutral firewall and network types (FirewallAction, Direction,
IpProtocol, NetworkProtocol, VipMode, LaggProtocol) from harmony Score
modules to harmony_types::firewall as industry-standard IaC types.

Display impls use human-readable names (IPv4, CARP, LACP) — not wire
format. OPNsense-specific wire translations live in opnsense-api::wire
via the ToOPNsenseValue trait ("inet", "carp", "lacp").

Dependency chain: harmony_types → opnsense-api → opnsense-config → harmony.
Users import types from harmony_types, translations happen transparently
in the infrastructure layer.

Includes 6 new tests verifying all wire value translations.
2026-03-26 10:11:53 -04:00
1b86c895a5 refactor(opnsense): replace stringly-typed fields with enums across Scores
Some checks failed
Run Check Script / check (pull_request) Failing after 19s
Add shared enums for firewall, NAT, and LAGG Score definitions:
- FirewallAction (Pass, Block, Reject)
- Direction (In, Out)
- IpProtocol (Inet, Inet6) — shared across filter, SNAT, DNAT
- NetworkProtocol (Tcp, Udp, TcpUdp, Icmp, Any) — shared across all rule types
- LaggProtocol (Lacp, Failover, LoadBalance, RoundRobin, None)

Combined with the VipMode enum from the previous commit, all OPNsense
Score definitions now use proper types instead of raw strings. Typos in
mode/action/direction/protocol fields are now compile-time errors.
2026-03-26 00:06:40 -04:00
2a15a0d10b refactor(opnsense): use VipMode enum instead of string for VIP mode
Replace the stringly-typed mode field in VipDef with a VipMode enum
(IpAlias, Carp, ProxyArp). Prevents typos and makes the API discoverable
through IDE autocompletion. The as_api_str() method converts to the wire
format expected by OPNsense.
2026-03-25 23:58:01 -04:00
da90dc55ad chore: cargo fmt across workspace
Some checks failed
Run Check Script / check (pull_request) Failing after 19s
2026-03-25 23:20:57 -04:00
516626a0ce docs: add OPNsense VM integration tutorial and architecture challenges
New use-case tutorial walking newcomers through the full OPNsense VM
integration test: system setup, VM boot, SSH config, running all 11
Scores, and understanding the three-layer architecture.

Add architecture-challenges.md analyzing topology evolution during
deployment, runtime plan/validation phase, and TUI as primary interface.
2026-03-25 23:20:45 -04:00
6c664e9f34 docs(roadmap): add phases 7-8 for OPNsense and HA OKD production
Add Phase 7 (OPNsense & Bare-Metal Network Automation) tracking current
progress on OPNsense Scores, codegen, and Brocade integration. Details
the UpdateHostScore requirement and HostNetworkConfigurationScore rework
needed for LAGG LACP 802.3ad.

Add Phase 8 (HA OKD Production Deployment) describing the target
architecture with LAGG/CARP/multi-WAN/BINAT and validation checklist.

Update current state section to reflect opnsense-codegen branch progress.
2026-03-25 23:20:35 -04:00
082ea8a666 feat(harmony): add duration timing to Score::interpret
Every Score execution now logs its status and elapsed time after
completion. The timing is measured in Score::interpret (the central
execution path) so it applies to all Scores automatically.

Example output:
  [VlanScore] SUCCESS in 0.9s — Created 2 VLANs
  [DhcpScore] SUCCESS in 1.8s — Dhcp execution successful
  [LoadBalancerScore] FAILED after 45.3s — connection refused
2026-03-25 23:20:24 -04:00
d33125bba8 feat(okd): automate SCP uploads, implement wait_for_bootstrap_complete
Replace manual scp prompts in bootstrap_02 and ipxe with automated
StaticFilesHttpScore uploads. SCOS installer images and HTTP boot files
now upload via SFTP without operator intervention.

Implement wait_for_bootstrap_complete by shelling out to
openshift-install wait-for bootstrap-complete with stdout/stderr logging.
Previously this was a todo!() that would panic and crash mid-deployment.

Add [Stage 02/Bootstrap] prefixes to all bootstrap_02 log messages.
Improve bootstrap_okd_node outcome to include per-host details with
MAC addresses.
2026-03-25 23:20:16 -04:00
1f0a7ed5a5 feat(opnsense): implement Url::Url support in HTTP and TFTP infra
Replace todo!() in OPNSenseFirewall HTTP and TFTP serve_files with
download-then-upload logic. When a Url::Url is provided, download the
remote file to a temp directory via reqwest, then upload to OPNsense
via the existing SFTP path.

Enables StaticFilesHttpScore and TftpScore to serve files from remote
URLs (e.g. S3) in addition to local folders.
2026-03-25 23:20:07 -04:00
c24fa9315b feat(harmony_assets): S3 credentials, folder upload, 19 tests
Fix S3Store to actually wire access_key_id/secret_access_key from config
into the AWS SDK credential provider. Add force_path_style for custom
endpoints (Ceph, MinIO). Add store_folder() for recursive directory upload.

New CLI command: upload-folder with --public-read/private ACL, env var
fallback for credentials, content-type auto-detection, progress bar.

Fix single-file upload --public-read default (was always true, now false).

Add 19 tests: Asset path computation, LocalStore fetch/cache/404/checksum
with httptest mocks, S3 key extraction, URL generation for custom/AWS
endpoints.
2026-03-25 23:19:58 -04:00
7475e7b75e feat(opnsense): implement remove_static_mapping and list_static_mappings
Wire the existing dnsmasq remove_static_mapping through the OPNSenseFirewall
infra layer. Add list_static_mappings at both config and infra layers for
querying current DHCP host entries. Includes 6 new unit tests with httptest
mocks covering empty, single/multi-MAC, multiple hosts, and skip edge cases.

Foundation for the upcoming UpdateHostScore.
2026-03-25 23:19:47 -04:00
d75ebcbb74 feat(opnsense): VipScore, DnatScore, LaggScore tested with 4-NIC VM
Some checks failed
Run Check Script / check (pull_request) Failing after 16s
Add VIP (IP alias / CARP) and destination NAT (port forwarding) Scores.
Update VM to 4 NICs (LAN, WAN, LAGG member 1, LAGG member 2) so LAGG
can be tested with failover protocol on vtnet2+vtnet3.

All 11 Scores pass end-to-end against OPNsense VM:
- LoadBalancerScore, DhcpScore, TftpScore, NodeExporterScore
- VlanScore (2 VLANs on vtnet0)
- FirewallRuleScore (filter rule with gateway support)
- OutboundNatScore (SNAT), BinatScore (1:1 NAT)
- VipScore (IP alias on LAN)
- DnatScore (port forward 8443→192.168.1.50:443)
- LaggScore (failover LAGG on vtnet2+vtnet3)
2026-03-25 16:59:52 -04:00
cea008e9c9 feat(opnsense): FirewallRuleScore, OutboundNatScore, BinatScore
Add Scores for managing OPNsense new-generation firewall filter rules,
outbound NAT (SNAT), and 1:1 NAT (BINAT) via the REST API.

- opnsense-config: firewall.rs module with idempotent CRUD for filter
  rules, SNAT rules, and BINAT rules (match by description)
- harmony: FirewallRuleScore (with gateway support for multi-WAN),
  OutboundNatScore, BinatScore
- All 3 tested end-to-end against OPNsense VM, idempotent on re-run
- Integration test now exercises 8 Scores total
2026-03-25 16:18:25 -04:00
ac9320fca4 feat(opnsense-codegen): expand custom ArrayField subclasses into full structs
Fix codegen to handle FilterRuleField, SourceNatRuleField, and other
custom *Field types that extend ArrayField. When an XML element has
a custom type AND child elements with type attributes, recursively
parse children into struct fields instead of falling back to
Option<String> stubs.

Also fix hyphenated field names (state-policy → state_policy with
serde rename) and avoid enum name collisions by using the full struct
name as prefix for custom *Field enums.

Regenerated firewall_filter.rs: now has full FirewallFilterRulesRule
(60+ fields including action, direction, gateway, source/dest nets),
FirewallFilterSnatrulesRule, FirewallFilterNptRule,
FirewallFilterOnetooneRule.

New generated modules:
- vip.rs — Virtual IPs (CARP, IP aliases, ProxyARP)
- firewall_alias.rs — Firewall aliases (host, network, port, URL, GeoIP)
- firewall_dnat.rs — Destination NAT / port forwarding rules
2026-03-25 16:00:35 -04:00
2b4c9ac3fb feat(opnsense): VlanScore and LaggScore for network infrastructure
Add VLAN and LAGG management via the OPNsense REST API:

- opnsense-config: vlan.rs and lagg.rs modules with idempotent CRUD
- harmony: VlanScore and LaggScore with OPNSenseFirewall integration
- VlanScore tested end-to-end against OPNsense VM (2 VLANs on vtnet0)
- LaggScore implemented but not VM-testable (needs physical NICs)
- Handle OPNsense select widget fields in VLAN interface responses
- Use direct post_typed calls (addItem/setItem/delItem/reconfigure)
2026-03-25 14:39:30 -04:00
fe22c50122 feat(opnsense): end-to-end validation of all OPNsense Scores
Run LoadBalancerScore, DhcpScore, TftpScore, and NodeExporterScore
against a real OPNsense VM to prove the XML→API migration works.

- Add Router impl for OPNSenseFirewall (gateway + /24 CIDR)
- Fix TFTP/NodeExporter API controller paths (general, not settings)
- Fix TFTP/NodeExporter body wrapper key (general, not module name)
- Fix dnsmasq DHCP range API endpoint (Range, not DhcpRang)
- Fix dnsmasq deserialization for OPNsense select widgets and empty []
- Fix DhcpHostBindingInterpret error propagation (was todo!())
- Expand VM integration example with all 4 Scores + API verification
2026-03-25 14:04:44 -04:00
f8d1f858d0 feat(opnsense): configurable API port, move web GUI to 9443
Add Config::from_credentials_with_api_port() and
OPNSenseFirewall::with_api_port() so the API port is not hardcoded
to 443. This allows running HAProxy on standard ports without
conflicting with the OPNsense web UI.

The integration example now instructs users to change the web GUI
port to 9443 (System > Settings > Administration > TCP Port) as
part of the manual setup, alongside enabling SSH.

The --status command detects whether the API is on 443 or 9443
and advises accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 12:17:34 -04:00
8a435d2769 docs(opnsense-vm-integration): update README with current status
Document the full workflow, network architecture, manual SSH step,
Docker compatibility, known issues, and future improvements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 12:07:35 -04:00
095801ac4d fix(opnsense-vm-integration): handle firmware update before package install
When OPNsense is on a base version that needs updating before packages
can install, attempt a firmware update and retry. Use high ports
(16443/18443) for test HAProxy services to avoid conflicting with
the OPNsense web UI on port 443.

Known issue: firmware update on a fresh 26.1 nano image may need
a manual reboot cycle before packages install successfully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 12:04:50 -04:00
777213288e fix(opnsense-config): use serde_json::Value for HAProxy config traversal
The hand-written HaproxyGetResponse structs used HashMap which fails
when OPNsense returns [] for empty collections. The generated types
in opnsense-api handle this via opn_map, but opnsense-config had
duplicated structs without that fix.

Replace all hand-written HAProxy response types with serde_json::Value
traversal. This avoids the duplication and handles the []/{} duality.

Also fix integration example:
- Use high ports (16443, 18443) to avoid conflicting with web UI on 443
- Skip package install if already installed
- Use harmony_cli::cli_logger::init() instead of env_logger (safe to
  call multiple times)
- Increase verification timeout to 60s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 11:42:35 -04:00
3fd333caa3 fix(opnsense-vm-integration): detect and fix Docker+libvirt FORWARD conflict
Docker sets iptables FORWARD policy to DROP, which blocks libvirt's
NAT networking (libvirt defaults to nftables which doesn't interact
with Docker's iptables chain).

Fix: setup-libvirt.sh now detects Docker and offers to switch libvirt
to the iptables firewall backend, so both sets of rules coexist.
The --check command warns about this mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 11:08:03 -04:00
c2d817180b refactor(opnsense-vm-integration): clean two-phase workflow
Restructure the example into two clear phases:

Phase 1 (--boot): creates KVM network + VM, waits for web UI,
prints instructions for enabling SSH via the OPNsense GUI.

Phase 2 (default run): checks SSH is reachable, creates API key,
installs HAProxy, runs LoadBalancerScore, verifies via API.

The config.xml injection sets vtnet0=LAN (192.168.1.1) and
vtnet1=WAN (DHCP). SSH must be enabled manually in the web UI
because OPNsense has no REST API for SSH management and the
config.xml injection doesn't reliably enable sshd.

Future: use a pre-customized OPNsense image on S3 for CI.

Also add show_ssh_config example to opnsense-api crate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 10:26:23 -04:00
31c3a52750 feat(opnsense): config.xml injection for nano image + dual NIC setup
Add opnsense::image module for customizing OPNsense nano disk images:
- find_config_offset(): scans raw image for config.xml location
- replace_config_xml(): overwrites config with null-padded replacement
- minimal_config_xml(): generates WAN+LAN config for virtio NICs
- Supports auto-scanning for unknown images

KVM improvements:
- disk_from_path(): attach existing disk images (not just new volumes)
- start_vm() now idempotent (skips if already running)
- cdrom uses SATA bus instead of IDE (q35 compatibility)

Integration example updates:
- LAN on 192.168.1.0/24 (matches OPNsense defaults, host reachable)
- WAN on libvirt default network (internet access)
- Config.xml injection replaces em0/em1 with vtnet0/vtnet1
- API key creation via PHP script (writes to file, avoids escaping)

Status: VM boots, web UI responds at 192.168.1.1, interfaces assigned.
Remaining: SSH enablement in config.xml, API key creation, WAN subnet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 09:30:36 -04:00
2e3af21b61 chore(opnsense-vm-integration): add setup-libvirt.sh script
Interactive script that installs packages, adds user to libvirt group,
starts libvirtd, and creates the default storage pool. Asks before
each step (or run with --yes for non-interactive).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 23:52:10 -04:00
bc1f8e8a9d feat(opnsense-vm-integration): add --check, --setup, --download subcommands
Add prerequisite checking (libvirtd, group membership, storage pool,
bunzip2) with clear error messages and fix suggestions.

Add --setup to print the exact sudo commands needed for initial setup.
Add --download to pre-fetch and decompress the OPNsense nano image.

Full flow: download image → create network with DHCP → boot VM →
discover IP via libvirt lease → wait for API → create API key via
SSH → install HAProxy + Caddy → run LoadBalancerScore → verify.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 23:45:41 -04:00
7eef3115e9 feat(kvm): add VM IP discovery, DHCP networks, and OPNsense integration example
KVM module enhancements:
- Add vm_ip() and wait_for_ip() to KvmExecutor using
  Domain::interface_addresses() for DHCP IP discovery
- Add DHCP range and static host entries to NetworkConfig/NetworkConfigBuilder
- Generate DHCP XML in network definitions for libvirt's built-in DHCP
- Export DhcpHost type

OPNsense VM integration example (opnsense-vm-integration):
- Boots OPNsense nano VM via KVM
- Discovers IP via libvirt DHCP lease query
- Creates API key via SSH
- Installs HAProxy + Caddy via firmware API
- Runs LoadBalancerScore (2 services: K8s API + HTTPS)
- Verifies HAProxy configuration via API

22 KVM unit tests pass (3 new DHCP tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 23:37:13 -04:00
d48200b3d5 docs(kvm): document XML template decision and upstream tracking
Explain why we use string templates for libvirt XML generation and
what the path to typed structs looks like. The best candidate is
libvirt-rust-xml (gen branch) which generates Rust structs from
libvirt's RelaxNG schemas via relaxng-gen, but it doesn't compile
yet (virtxml-domain has 6 errors as of baca481).

Also fix dead code in format_cdrom (redundant device_type branch).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 23:04:53 -04:00
b18c8d534a feat(kvm): add 17 unit tests and VM examples for all infrastructure patterns
Add comprehensive XML generation tests covering: multi-disk VMs,
multi-NIC configurations, MAC addresses, boot order, memory conversion,
sequential disk naming, custom storage pools, NAT/route/isolated
networks, volume sizing, builder defaults, q35 machine type, and
serial console.

Add kvm-vm-examples binary with 5 scenarios:
- alpine: minimal 512MB VM, fast boot for testing
- ubuntu: standard server with 25GB disk
- worker: multi-disk (60G OS + 2x100G Ceph OSD) for storage nodes
- gateway: dual-NIC (WAN NAT + LAN isolated) for firewall/router
- ha-cluster: full 7-VM deployment (gateway + 3 CP + 3 workers)

Each scenario has clean and status subcommands.

19 KVM unit tests pass (17 new + 2 existing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 21:16:08 -04:00
474e5a8dd2 test(opnsense-api): add 11 e2e tests against real OPNsense instance
Add integration tests that verify the full stack against a real OPNsense
VM. Tests are #[ignore]d by default — run with:

  OPNSENSE_TEST_URL=https://10.99.99.1/api \
  OPNSENSE_TEST_KEY=key OPNSENSE_TEST_SECRET=secret \
  cargo test -p opnsense-api --test e2e_test -- --ignored

Tests cover:
- Firmware: status, package list
- Dnsmasq: settings/get, CRUD host lifecycle, add_static_mapping via config
- HAProxy: settings/get, CRUD server, configure_service + idempotency
- VLAN, WireGuard, Firewall: settings/get

Each test cleans up after itself. Do NOT run against production.

Also make DhcpConfigDnsMasq::new and LoadBalancerConfig::new pub for
external test usage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 20:39:54 -04:00
dd92e15f96 test(opnsense-config): restore unit tests with httptest mocks
Add 14 unit tests covering the critical business logic:

Dnsmasq (11 tests):
- add_static_mapping: create new, update by IP, update by hostname,
  hostname/domain splitting, duplicate MAC handling
- Conflict detection: IP/hostname in different entries, multiple matches
- remove_static_mapping: partial remove, full delete, case insensitivity

Load balancer (3 tests):
- configure_service creates all components (healthcheck→server→backend→frontend)
- Idempotent replacement on same bind address (cascade delete then re-create)
- Isolation between services on different bind addresses

Tests use httptest to mock the OPNsense API — no VM or real firewall needed.
All 100 tests pass across the workspace (0 failures).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 20:33:31 -04:00
c608975d30 feat(opnsense-config): replace XML backend with REST API
Replace opnsense-config-xml dependency with opnsense-api. All
configuration CRUD now goes through the OPNsense REST API instead
of SSH + XML editing of /conf/config.xml.

Key changes:
- Config struct holds OpnsenseClient + SSH shell (for file ops only)
- Module handlers (dnsmasq, haproxy, caddy, tftp, node_exporter) are
  now API-backed with async methods
- apply()/save() are no-ops — each module calls reconfigure after mutations
- install_package uses firmware API with polling
- LoadBalancer uses new domain types (LbFrontend, LbBackend, LbServer,
  LbHealthCheck) instead of XML types, with UUID chaining via API
- Dnsmasq conflict detection logic preserved, adapted for API HashMap
- RwLock<Config> replaced with Arc<Config> — Config is now stateless

Benefits over XML approach:
- Per-module soft reload instead of "reload all services"
- Server-side validation of all changes
- No more hash-based race condition detection
- No more fragile XML schema coupling

SSH retained for: file uploads, PXE config writing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 19:35:40 -04:00
6c9472212c docs(opnsense-api): add README with example usage
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 19:07:32 -04:00
bc4dcdf942 feat(opnsense): upgrade to 26.1.5, handle array select widgets
- Pin vendor/core submodule to 26.1.5 tag (matches running firewall)
- Regenerate dnsmasq from model v1.0.9 (migrated during firmware upgrade)
- Handle array-style select widgets in enum deserialization: OPNsense
  sometimes returns [{value, selected}, ...] instead of {key: {value, selected}}
- Add firmware_upgrade and reboot examples for managing OPNsense updates
- All 7 modules validated against live OPNsense 26.1.5:
  dnsmasq, haproxy, caddy, vlan, lagg, wireguard, firewall

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 19:03:04 -04:00
8a7cbf4836 fix(opnsense-codegen): preserve unknown enum values with Other(String)
Replace lossy enum deserialization (unknown variants → None) with
Other(String) catch-all variant. This ensures unknown wire values
survive round-trips: reading an object and POSTing it back will not
silently destroy field values that the codegen doesn't recognize.

This is critical for data integrity — in a read-modify-write cycle,
dropping an unknown enum value would overwrite it with empty on the
next POST.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:17:32 -04:00
4af5e7ac19 feat(opnsense): generate types for all 7 modules with codegen fixes
Generate typed API models for HAProxy, Caddy, Firewall, VLAN, LAGG,
WireGuard (client/server/general), and regenerate Dnsmasq. All core
modules validated against a live OPNsense 26.1.2 instance.

Codegen improvements:
- Add --module-name and --api-key CLI flags for controlling output
  filenames and API response envelope keys
- Fix enum variant names starting with digits (prefix with V)
- Use value="" XML attribute for wire values instead of element names
- Handle unknown *Field types as opn_string (select widget safe)
- Forgiving enum deserialization (warn instead of error on unknown)
- Handle empty arrays in opn_string deserializer

Add per-module examples (list_haproxy, list_caddy, list_vlan, etc.)
and utility examples (raw_get, check_package, install_and_wait).
Extract shared client setup into examples/common/mod.rs.

Fix post_typed sending empty JSON body ({}) instead of no body,
which was causing 400 errors on firmware endpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:11:02 -04:00
0dc2f94b06 feat(opnsense-api): add CRUD methods and common response types
Add entity-level CRUD operations (get_item, add_item, set_item,
del_item, search_items) and service management (reconfigure,
service_status) to OpnsenseClient. These map directly to OPNsense's
MVC controller patterns.

Add response module with UuidResponse, StatusResponse, and
SearchResponse<T> covering the standard OPNsense API response shapes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:35:40 -04:00
eff75f4118 misc: Add test dnsmasq end to end codegen 2026-03-24 15:29:01 -04:00
f28edb3134 feat(opnsense-codegen): codegen now works for dnsmasq end to end from the model to the api 2026-03-24 15:28:00 -04:00
88e6990051 feat(opnsense-api): examples to list packages and dnsmasq settings now working 2026-03-24 14:07:47 -04:00
8e9f8ce405 wip: opnsense-api crate to replace opnsense-config-xml 2026-03-24 13:26:36 -04:00
d87aa3c7e9 fix opnsense sumbodule url 2026-03-24 10:51:38 -04:00
90ec2b524a wip(codegen): generates ir and rust code successfully but not really tested yet 2026-03-24 10:23:52 -04:00
5572f98d5f wip(opnsense-codegen): Can now create IR that looks good from example, successfully parses real models too 2026-03-24 09:32:21 -04:00
8024e0d5c3 wip: opnsense codegen 2026-03-24 07:13:53 -04:00
238e7da175 feat: opnsense codegen basic example scaffolded, now we can start implementing real models 2026-03-23 23:27:40 -04:00
bf84bffd57 wip: config + secret merge with e2e sso examples incoming 2026-03-23 23:26:42 -04:00
d4613e42d3 wip: openbao + zitadel e2e setup and test for harmony_config 2026-03-22 21:27:06 -04:00
6a57361356 chore: Update config roadmap
Some checks failed
Run Check Script / check (pull_request) Failing after 12s
2026-03-22 19:04:16 -04:00
d0d4f15122 feat(config): Example prompting
Some checks failed
Run Check Script / check (pull_request) Failing after 14s
2026-03-22 18:18:57 -04:00
93b83b8161 feat(config): Sqlite storage and example 2026-03-22 17:43:12 -04:00
6ca8663422 wip: Roadmap for config 2026-03-22 16:57:36 -04:00
f6ce0c6d4f chore: Harmony short term roadmap 2026-03-22 11:43:43 -04:00
8a1eca21f7 Merge branch 'feat/harmony_assets' into feature/kvm-module 2026-03-22 11:26:04 -04:00
9d2308eca6 Merge remote-tracking branch 'origin/master' into feature/kvm-module
All checks were successful
Run Check Script / check (pull_request) Successful in 1m48s
2026-03-22 10:02:10 -04:00
ccc26e07eb feat: harmony_asset crate to manage assets, local, s3, http urls, etc
Some checks failed
Run Check Script / check (pull_request) Failing after 17s
2026-03-21 11:10:51 -04:00
9a67bcc96f Merge pull request 'fix/cnpgInstallation' (#251) from fix/cnpgInstallation into master
Some checks failed
Run Check Script / check (push) Successful in 1m45s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m15s
Reviewed-on: #251
2026-03-20 21:02:53 +00:00
a377fc1404 Merge branch 'master' into fix/cnpgInstallation
All checks were successful
Run Check Script / check (pull_request) Successful in 1m44s
2026-03-20 20:56:30 +00:00
c9977fee12 fix: CI file moved
All checks were successful
Run Check Script / check (pull_request) Successful in 2m5s
2026-03-20 16:48:38 -04:00
64bf585e07 fix: remove check.sh with broken path handling and fix formatting
Some checks failed
Run Check Script / check (pull_request) Failing after 12s
2026-03-20 16:41:30 -04:00
44e2c45435 fix: flaky tests due to bad environment variables handling in harmony_config crate 2026-03-20 16:40:08 -04:00
cdccbc8939 fix: formatting and minor stuff 2026-03-20 16:34:48 -04:00
9830971d05 feat: Creat namespace in k8s client and wait for namespace ready utility functions 2026-03-20 16:15:51 -04:00
e1183ef6de feat: K8s postgresql score now ensures cnpg is installed 2026-03-20 07:02:26 -04:00
444fea81b8 docs: Fix examples cli in docs
Some checks failed
Run Check Script / check (pull_request) Failing after 12s
2026-03-19 22:52:05 -04:00
907ae04195 chore: Add book.sh script and ci.sh, moved check.sh to build/ folder
Some checks failed
Run Check Script / check (pull_request) Failing after 9s
2026-03-19 22:43:32 -04:00
64582caa64 docs: Major rehaul of documentation
Some checks failed
Run Check Script / check (pull_request) Failing after 10s
2026-03-19 22:38:55 -04:00
f5736fcc37 wip: Config and secret management merging planification and high level documentation
Some checks failed
Run Check Script / check (pull_request) Failing after 43s
2026-03-19 17:02:17 -04:00
7a1e84fb68 doc: Adr 020 on interactive harmony configuration for great UX 2026-03-18 10:40:19 -04:00
8499f4d1b7 Merge pull request 'fix: small details were preventing to re-save frontends,backends and healthchecks in opnsense UI' (#248) from fix/load-balancer-xml into master
Some checks failed
Run Check Script / check (push) Has been cancelled
Compile and package harmony_composer / package_harmony_composer (push) Has been cancelled
Reviewed-on: #248
2026-03-17 14:38:35 +00:00
231d9b878e debt: Ignore interactive tests with inquire prompts
All checks were successful
Run Check Script / check (pull_request) Successful in 1m21s
2026-03-15 11:37:31 -04:00
ee2dade0be Merge remote-tracking branch 'origin/master' into feat/brocade_assisted_setup
Some checks failed
Run Check Script / check (pull_request) Failing after 1m28s
2026-03-15 10:12:22 -04:00
aa07f4c8ad Merge pull request 'fix/dynamically_get_public_domain' (#234) from fix/dynamically_get_public_domain into master
Some checks failed
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m48s
Run Check Script / check (push) Failing after 11m1s
Reviewed-on: #234
Reviewed-by: johnride <jg@nationtech.io>
2026-03-15 14:07:25 +00:00
77bb138497 Merge remote-tracking branch 'origin/master' into fix/dynamically_get_public_domain
All checks were successful
Run Check Script / check (pull_request) Successful in 1m20s
2026-03-15 09:54:36 -04:00
a16879b1b6 Merge pull request 'fix: readded tokio retry to get ca cert for a nats cluster which was accidentally removed during a refactor' (#229) from fix/nats-ca-cert-retry into master
Some checks failed
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m1s
Run Check Script / check (push) Failing after 12m23s
Reviewed-on: #229
2026-03-15 12:36:05 +00:00
f57e6f5957 Merge pull request 'feat: add priorityClass to node_health daemonset' (#249) from feat/health_endpoint_priority_class into master
Some checks failed
Run Check Script / check (push) Successful in 1m28s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m59s
Reviewed-on: #249
2026-03-14 18:53:30 +00:00
7605d05de3 fix: opnsense fixes for st-mcd (cb1)
All checks were successful
Run Check Script / check (pull_request) Successful in 1m29s
2026-03-13 13:13:37 -04:00
b244127843 feat: add priorityClass to node_health daemonset
All checks were successful
Run Check Script / check (pull_request) Successful in 1m27s
2026-03-13 11:18:18 -04:00
67c3265286 fix: small details were preventing to re-save frontends,backends and healthchecks in opnsense UI
All checks were successful
Run Check Script / check (pull_request) Successful in 2m12s
2026-03-13 10:31:17 -04:00
d10598d01e Merge pull request 'okdload balancer using 1936 port http healthcheck' (#240) from feat/okd_loadbalancer_betterhealthcheck into master
Some checks failed
Run Check Script / check (push) Successful in 1m26s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m56s
Reviewed-on: #240
2026-03-10 17:45:51 +00:00
61ba7257d0 fix: remove broken test
All checks were successful
Run Check Script / check (pull_request) Successful in 1m22s
2026-03-10 13:40:24 -04:00
8798110bf3 feat: linux vm example with cdrom boot and iso download features
All checks were successful
Run Check Script / check (pull_request) Successful in 1m32s
2026-03-08 21:48:04 -04:00
1508d431c0 refactor: kvm module now efficiently encapsulate libvirt complexity behind builder patterns, no more xml 2026-03-08 12:08:19 -04:00
caf6f0c67b Add KVM module for managing virtual machines
- KVM module with connection configuration (local/SSH)
- VM lifecycle management (create/start/stop/destroy/delete)
- Network management (create/delete isolated virtual networks)
- Volume management (create/delete storage volumes)
- Example: OKD HA cluster deployment with OPNsense firewall
- All VMs configured for PXE boot with isolated network

The KVM module uses virsh command-line tools for management and is fully integrated with Harmony's architecture. It provides a clean Rust API for defining VMs, networks, and volumes. The example demonstrates deploying a complete OKD high-availability cluster (3 control planes, 3 workers) plus OPNsense firewall on an isolated network.
2026-03-08 08:06:10 -04:00
b0e9594d92 Merge branch 'master' into feat/okd_loadbalancer_betterhealthcheck
Some checks failed
Run Check Script / check (pull_request) Failing after 47s
2026-03-07 23:06:50 +00:00
bfb86f63ce fix: xml field for vlan
All checks were successful
Run Check Script / check (pull_request) Successful in 1m31s
2026-03-07 11:29:44 -05:00
d920de34cf fix: configure health_check: None for public_services
All checks were successful
Run Check Script / check (pull_request) Successful in 1m35s
2026-03-05 14:55:00 -05:00
4276b9137b fix: put the hc on private_services, not public_services
All checks were successful
Run Check Script / check (pull_request) Successful in 1m32s
2026-03-05 14:35:33 -05:00
6ab88ab8d9 Merge branch 'master' into feat/okd_loadbalancer_betterhealthcheck 2026-03-04 10:46:57 -05:00
53d0704a35 wip: okdload balancer using 1936 port http healthcheck
Some checks failed
Run Check Script / check (pull_request) Failing after 31s
2026-03-02 20:47:41 -05:00
de49e9ebcc feat: Brocade switch setup now asks questions for missing links instead of failing
Some checks failed
Run Check Script / check (pull_request) Failing after 53s
2026-02-19 10:31:47 -05:00
d8ab9d52a4 fix:broken test
All checks were successful
Run Check Script / check (pull_request) Successful in 1m0s
2026-02-17 15:34:42 -05:00
2cb7aeefc0 fix: deploys replicated postgresql with site 2 as standby
Some checks failed
Run Check Script / check (pull_request) Failing after 1m8s
2026-02-17 15:02:00 -05:00
16016febcf wip: adding impl details for deploying connected replica cluster 2026-02-16 16:22:30 -05:00
e709de531d fix: added route building to failover topology 2026-02-13 16:08:05 -05:00
6ab0f3a6ab wip 2026-02-13 15:48:24 -05:00
724ab0b888 wip: removed hardcoding and added fn to trait tlsrouter 2026-02-13 15:18:23 -05:00
8b6ce8d069 fix: readded tokio retry to get ca cert for a nats cluster which was accidentally removed during a refactor
All checks were successful
Run Check Script / check (pull_request) Successful in 1m9s
2026-02-06 09:09:01 -05:00
451 changed files with 66352 additions and 9815 deletions

View File

@@ -15,4 +15,4 @@ jobs:
uses: actions/checkout@v4
- name: Run check script
run: bash check.sh
run: bash build/check.sh

3
.gitignore vendored
View File

@@ -29,3 +29,6 @@ Cargo.lock
# Useful to create ignore folders for temp files and notes
ignore
# Generated book
book

12
.gitmodules vendored
View File

@@ -1,3 +1,15 @@
[submodule "examples/try_rust_webapp/tryrust.org"]
path = examples/try_rust_webapp/tryrust.org
url = https://github.com/rust-dd/tryrust.org.git
[submodule "/home/jeangab/work/nationtech/harmony2/opnsense-codegen/vendor/core"]
path = /home/jeangab/work/nationtech/harmony2/opnsense-codegen/vendor/core
url = https://github.com/opnsense/core.git
[submodule "/home/jeangab/work/nationtech/harmony2/opnsense-codegen/vendor/plugins"]
path = /home/jeangab/work/nationtech/harmony2/opnsense-codegen/vendor/plugins
url = https://github.com/opnsense/plugins.git
[submodule "opnsense-codegen/vendor/core"]
path = opnsense-codegen/vendor/core
url = https://github.com/opnsense/core.git
[submodule "opnsense-codegen/vendor/plugins"]
path = opnsense-codegen/vendor/plugins
url = https://github.com/opnsense/plugins.git

View File

@@ -1,548 +0,0 @@
# CI and Testing Strategy for Harmony
## Executive Summary
Harmony aims to become a CNCF project, requiring a robust CI pipeline that demonstrates real-world reliability. The goal is to run **all examples** in CI, from simple k3d deployments to full HA OKD clusters on bare metal. This document provides context for designing and implementing this testing infrastructure.
---
## Project Context
### What is Harmony?
Harmony is an infrastructure automation framework that is **code-first and code-only**. Operators write Rust programs to declare and drive infrastructure, rather than YAML files or DSL configs. Key differentiators:
1. **Compile-time safety**: The type system prevents "config-is-valid-but-platform-is-wrong" errors
2. **Topology abstraction**: Write once, deploy to any environment (local k3d, OKD, bare metal, cloud)
3. **Capability-based design**: Scores declare what they need; topologies provide what they have
### Core Abstractions
| Concept | Description |
|---------|-------------|
| **Score** | Declarative description of desired state (the "what") |
| **Topology** | Logical representation of infrastructure (the "where") |
| **Capability** | A feature a topology offers (the "how") |
| **Interpret** | Execution logic connecting Score to Topology |
### Compile-Time Verification
```rust
// This compiles only if K8sAnywhereTopology provides K8sclient + HelmCommand
impl<T: Topology + K8sclient + HelmCommand> Score<T> for MyScore { ... }
// This FAILS to compile - LinuxHostTopology doesn't provide K8sclient
// (intentionally broken example for testing)
impl<T: Topology + K8sclient> Score<T> for K8sResourceScore { ... }
// error: LinuxHostTopology does not implement K8sclient
```
---
## Current Examples Inventory
### Summary Statistics
| Category | Count | CI Complexity |
|----------|-------|---------------|
| k3d-compatible | 22 | Low - single k3d cluster |
| OKD-specific | 4 | Medium - requires OKD cluster |
| Bare metal | 4 | High - requires physical infra or nested virtualization |
| Multi-cluster | 3 | High - requires multiple K8s clusters |
| No infra needed | 4 | Trivial - local only |
### Detailed Example Classification
#### Tier 1: k3d-Compatible (22 examples)
Can run on a local k3d cluster with minimal setup:
| Example | Topology | Capabilities | Special Notes |
|---------|----------|--------------|---------------|
| zitadel | K8sAnywhereTopology | K8sClient, HelmCommand | SSO/Identity |
| node_health | K8sAnywhereTopology | K8sClient | Health checks |
| public_postgres | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Needs ingress |
| openbao | K8sAnywhereTopology | K8sClient, HelmCommand | Vault alternative |
| rust | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Webapp deployment |
| cert_manager | K8sAnywhereTopology | K8sClient, CertificateManagement | TLS certificates |
| try_rust_webapp | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | Full webapp |
| monitoring | K8sAnywhereTopology | K8sClient, HelmCommand, Observability | Prometheus |
| application_monitoring_with_tenant | K8sAnywhereTopology | K8sClient, HelmCommand, TenantManager, Observability | Multi-tenant |
| monitoring_with_tenant | K8sAnywhereTopology | K8sClient, HelmCommand, TenantManager, Observability | Multi-tenant |
| postgresql | K8sAnywhereTopology | K8sClient, HelmCommand | CloudNativePG |
| ntfy | K8sAnywhereTopology | K8sClient, HelmCommand | Notifications |
| tenant | K8sAnywhereTopology | K8sClient, TenantManager | Namespace isolation |
| lamp | K8sAnywhereTopology | K8sClient, HelmCommand, TlsRouter | LAMP stack |
| k8s_drain_node | K8sAnywhereTopology | K8sClient | Node operations |
| k8s_write_file_on_node | K8sAnywhereTopology | K8sClient | Node operations |
| remove_rook_osd | K8sAnywhereTopology | K8sClient | Ceph operations |
| validate_ceph_cluster_health | K8sAnywhereTopology | K8sClient | Ceph health |
| kube-rs | Direct kube | K8sClient | Raw kube-rs demo |
| brocade_snmp_server | K8sAnywhereTopology | K8sClient | SNMP collector |
| harmony_inventory_builder | LocalhostTopology | None | Network scanning |
| cli | LocalhostTopology | None | CLI demo |
#### Tier 2: OKD/OpenShift-Specific (4 examples)
Require OKD/OpenShift features not available in vanilla K8s:
| Example | Topology | OKD-Specific Feature |
|---------|----------|---------------------|
| okd_cluster_alerts | K8sAnywhereTopology | OpenShift Monitoring CRDs |
| operatorhub_catalog | K8sAnywhereTopology | OpenShift OperatorHub |
| rhob_application_monitoring | K8sAnywhereTopology | RHOB (Red Hat Observability) |
| nats-supercluster | K8sAnywhereTopology | OKD Routes (OpenShift Ingress) |
#### Tier 3: Bare Metal Infrastructure (4 examples)
Require physical hardware or full virtualization:
| Example | Topology | Physical Requirements |
|---------|----------|----------------------|
| okd_installation | HAClusterTopology | OPNSense, Brocade switch, PXE boot, 3+ nodes |
| okd_pxe | HAClusterTopology | OPNSense, Brocade switch, PXE infrastructure |
| sttest | HAClusterTopology | Full HA cluster with all network services |
| opnsense | OPNSenseFirewall | OPNSense firewall access |
| opnsense_node_exporter | Custom | OPNSense firewall |
#### Tier 4: Multi-Cluster (3 examples)
Require multiple K8s clusters:
| Example | Topology | Clusters Required |
|---------|----------|-------------------|
| nats | K8sAnywhereTopology × 2 | 2 clusters with NATS gateways |
| nats-module | DecentralizedTopology | 3 clusters for supercluster |
| multisite_postgres | FailoverTopology | 2 clusters for replication |
---
## Testing Categories
### 1. Compile-Time Tests
These tests verify that the type system correctly rejects invalid configurations:
```rust
// Should NOT compile - K8sResourceScore on LinuxHostTopology
#[test]
#[compile_fail]
fn test_k8s_score_on_linux_host() {
let score = K8sResourceScore::new();
let topology = LinuxHostTopology::new();
// This line should fail to compile
harmony_cli::run(Inventory::empty(), topology, vec![Box::new(score)], None);
}
// Should compile - K8sResourceScore on K8sAnywhereTopology
#[test]
fn test_k8s_score_on_k8s_topology() {
let score = K8sResourceScore::new();
let topology = K8sAnywhereTopology::from_env();
// This should compile
harmony_cli::run(Inventory::empty(), topology, vec![Box::new(score)], None);
}
```
**Implementation Options:**
- `trybuild` crate for compile-time failure tests
- Separate `tests/compile_fail/` directory with expected error messages
### 2. Unit Tests
Pure Rust logic without external dependencies:
- Score serialization/deserialization
- Inventory parsing
- Type conversions
- CRD generation
**Requirements:**
- No external services
- Sub-second execution
- Run on every PR
### 3. Integration Tests (k3d)
Deploy to a local k3d cluster:
**Setup:**
```bash
# Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
# Create cluster
k3d cluster create harmony-test \
--agents 3 \
--k3s-arg "--disable=traefik@server:0"
# Wait for ready
kubectl wait --for=condition=Ready nodes --all --timeout=120s
```
**Test Matrix:**
| Example | k3d | Test Type |
|---------|-----|-----------|
| zitadel | ✅ | Deploy + health check |
| cert_manager | ✅ | Deploy + certificate issuance |
| monitoring | ✅ | Deploy + metric collection |
| postgresql | ✅ | Deploy + database connectivity |
| tenant | ✅ | Namespace creation + isolation |
### 4. Integration Tests (OKD)
Deploy to OKD/OpenShift cluster:
**Options:**
1. **Nested virtualization**: Run OKD in VMs (slow, expensive)
2. **CRC (CodeReady Containers)**: Single-node OKD (resource intensive)
3. **Managed OpenShift**: AWS/Azure/GCP (costly)
4. **Existing cluster**: Connect to pre-provisioned cluster (fastest)
**Test Matrix:**
| Example | OKD Required | Test Type |
|---------|--------------|-----------|
| okd_cluster_alerts | ✅ | Alert rule deployment |
| rhob_application_monitoring | ✅ | RHOB stack deployment |
| operatorhub_catalog | ✅ | Operator installation |
### 5. End-to-End Tests (Full Infrastructure)
Complete infrastructure deployment including bare metal:
**Options:**
1. **Libvirt + KVM**: Virtual machines on CI runner
2. **Nested KVM**: KVM inside KVM (for cloud CI)
3. **Dedicated hardware**: Physical test lab
4. **Mock/Hybrid**: Mock physical components, real K8s
---
## CI Environment Options
### Option A: GitHub Actions (Current Standard)
**Pros:**
- Native GitHub integration
- Large runner ecosystem
- Free for open source
**Cons:**
- Limited nested virtualization support
- 6-hour job timeout
- Resource constraints on free runners
**Matrix:**
```yaml
strategy:
matrix:
os: [ubuntu-latest]
rust: [stable, beta]
k8s: [k3d, kind]
tier: [unit, k3d-integration]
```
### Option B: Self-Hosted Runners
**Pros:**
- Full control over environment
- Can run nested virtualization
- No time limits
- Persistent state between runs
**Cons:**
- Maintenance overhead
- Cost of infrastructure
- Security considerations
**Setup:**
- Bare metal servers with KVM support
- Pre-installed k3d, kind, CRC
- OPNSense VM for network tests
### Option C: Hybrid (GitHub + Self-Hosted)
**Pros:**
- Fast unit tests on GitHub runners
- Heavy tests on self-hosted infrastructure
- Cost-effective
**Cons:**
- Two CI systems to maintain
- Complexity in test distribution
### Option D: Cloud CI (CircleCI, GitLab CI, etc.)
**Pros:**
- Often better resource options
- Docker-in-Docker support
- Better nested virtualization
**Cons:**
- Cost
- Less GitHub-native
---
## Performance Requirements
### Target Execution Times
| Test Category | Target Time | Current (est.) |
|---------------|-------------|----------------|
| Compile-time tests | < 30s | Unknown |
| Unit tests | < 60s | Unknown |
| k3d integration (per example) | < 120s | 60-300s |
| Full k3d matrix | < 15 min | 30-60 min |
| OKD integration | < 30 min | 1-2 hours |
| Full E2E | < 2 hours | 4-8 hours |
### Sub-Second Performance Strategies
1. **Parallel execution**: Run independent tests concurrently
2. **Incremental testing**: Only run affected tests on changes
3. **Cached clusters**: Pre-warm k3d clusters
4. **Layered testing**: Fail fast on cheaper tests
5. **Mock external services**: Fake Discord webhooks, etc.
---
## Test Data and Secrets Management
### Secrets Required
| Secret | Use | Storage |
|--------|-----|---------|
| Discord webhook URL | Alert receiver tests | GitHub Secrets |
| OPNSense credentials | Network tests | Self-hosted only |
| Cloud provider creds | Multi-cloud tests | Vault / GitHub Secrets |
| TLS certificates | Ingress tests | Generated on-the-fly |
### Test Data
| Data | Source | Strategy |
|------|--------|----------|
| Container images | Public registries | Cache locally |
| Helm charts | Public repos | Vendor in repo |
| K8s manifests | Generated | Dynamic |
---
## Proposed Test Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ harmony_e2e_tests Package │
│ (cargo run -p harmony_e2e_tests) │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Compile │ │ Unit │ │ Compile-Fail Tests │ │
│ │ Tests │ │ Tests │ │ (trybuild) │ │
│ │ < 30s │ │ < 60s │ │ < 30s │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ k3d Integration Tests │ │
│ │ Self-provisions k3d cluster, runs 22 examples │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ zitadel │ │ cert-mgr│ │ monitor │ │ postgres│ ... │ │
│ │ │ 60s │ │ 90s │ │ 120s │ │ 90s │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ Parallel Execution │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ OKD Integration Tests │ │
│ │ Connects to existing OKD cluster or provisions via KVM │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ okd_cluster_ │ │ rhob_application_ │ │ │
│ │ │ alerts (5 min) │ │ monitoring (10 min) │ │ │
│ │ └─────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ KVM-based E2E Tests │ │
│ │ Uses Harmony's KVM module to provision test VMs │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ okd_installation│ │ Full HA cluster deployment │ │ │
│ │ │ (30-60 min) │ │ (60-120 min) │ │ │
│ │ └─────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Any CI system (GitHub Actions, GitLab CI, Jenkins, cron) just runs:
cargo run -p harmony_e2e_tests
```
┌─────────────────────────────────────────────────────────────────┐
GitHub Actions
├─────────────────────────────────────────────────────────────────┤
┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐
Compile Unit Compile-Fail Tests
Tests Tests (trybuild)
< 30s < 60s < 30s
└─────────────┘ └─────────────┘ └─────────────────────────┘
┌───────────────────────────────────────────────────────────┐
k3d Integration Tests
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
zitadel cert-mgr monitor postgres ...
60s 90s 120s 90s
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Parallel Execution
└───────────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
Self-Hosted Runners
├─────────────────────────────────────────────────────────────────┤
┌───────────────────────────────────────────────────────────┐
OKD Integration Tests
┌─────────────────┐ ┌─────────────────────────────┐
okd_cluster_ rhob_application_
alerts (5 min) monitoring (10 min)
└─────────────────┘ └─────────────────────────────┘
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
KVM-based E2E Tests (Harmony provisions)
┌─────────────────────────────────────────────────────┐
Harmony KVM Module provisions test VMs
- OKD HA Cluster (3 control plane, 2 workers)
- OPNSense VM (router/firewall)
- Brocade simulator VM
└─────────────────────────────────────────────────────┘
└───────────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────────────┘
```
---
## Questions for Researchers
### Critical Questions
1. **Self-contained test runner**: How to design `harmony_e2e_tests` package that runs all tests with a single `cargo run` command?
2. **Nested Virtualization**: What are the prerequisites for running KVM inside a test environment?
3. **Cost Optimization**: How to minimize cloud costs while running comprehensive E2E tests?
4. **Test Isolation**: How to ensure test isolation when running parallel k3d tests?
5. **State Management**: Should we persist k3d clusters between test runs, or create fresh each time?
6. **Mocking Strategy**: Which external services (Discord, OPNSense, etc.) should be mocked vs. real?
7. **Compile-Fail Tests**: Best practices for testing Rust compile-time errors?
8. **Multi-Cluster Tests**: How to efficiently provision and connect multiple K8s clusters in tests?
9. **Secrets Management**: How to handle secrets for test environments without external CI dependencies?
10. **Test Flakiness**: Strategies for reducing flakiness in infrastructure tests?
11. **Reporting**: How to present test results for complex multi-environment test matrices?
12. **Prerequisite Detection**: How to detect and validate prerequisites (Docker, k3d, KVM) before running tests?
### Research Areas
1. **CI/CD Tools**: Evaluate GitHub Actions, GitLab CI, CircleCI, Tekton, Prow for Harmony's needs
2. **K8s Test Tools**: Evaluate kind, k3d, minikube, microk8s for local testing
3. **Mock Frameworks**: Evaluate mock-server, wiremock, hoverfly for external service mocking
4. **Test Frameworks**: Evaluate built-in Rust test, nextest, cargo-tarpaulin for performance
---
## Success Criteria
### Week 1 (Agentic Velocity)
- [ ] Compile-time verification tests working
- [ ] Unit tests for monitoring module
- [ ] First 5 k3d examples running in CI
- [ ] Mock framework for Discord webhooks
### Week 2
- [ ] All 22 k3d-compatible examples in CI
- [ ] OKD self-hosted runner operational
- [ ] KVM module reviewed and ready for CI
### Week 3-4
- [ ] Full E2E tests with KVM infrastructure
- [ ] Multi-cluster tests automated
- [ ] All examples tested in CI
### Month 2
- [ ] Sub-15-minute total CI time
- [ ] Weekly E2E tests on bare metal
- [ ] Documentation complete
- [ ] Ready for CNCF submission
---
## Prerequisites
### Hardware Requirements
| Component | Minimum | Recommended |
|-----------|---------|------------|
| CPU | 4 cores | 8+ cores (for parallel tests) |
| RAM | 8 GB | 32 GB (for KVM E2E) |
| Disk | 50 GB SSD | 500 GB NVMe |
| Docker | Required | Latest |
| k3d | Required | v5.6.0 |
| Kubectl | Required | v1.28.0 |
| libvirt | Required | 9.0.0 (for KVM tests) |
### Software Requirements
| Tool | Version |
|------|---------|
| Rust | 1.75+ |
| Docker | 24.0+ |
| k3d | v5.6.0+ |
| kubectl | v1.28+ |
| libvirt | 9.0.0 |
### Installation (One-time)
```bash
# Install Rust
curl --proto '=https://sh.rustup.rs' -sSf | sh
# Install Docker
curl -fsSL https://get.docker.com -o docker-ce | sh
# Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
# Install kubectl
curl -LO "https://dl.k8s.io/release/v1.28.0/bin/linux/amd64" -o /usr/local/bin/kubectl
sudo mv /usr/local/bin/kubectl /usr/local/bin
```
---
## Reference Materials
### Existing Code
- Examples: `examples/*/src/main.rs`
- Topologies: `harmony/src/domain/topology/`
- Capabilities: `harmony/src/domain/topology/` (trait definitions)
- Scores: `harmony/src/modules/*/`
### Documentation
- [Coding Guide](docs/coding-guide.md)
- [Core Concepts](docs/concepts.md)
- [Monitoring Architecture](docs/monitoring.md)
- [ADR-020: Monitoring](adr/020-monitoring-alerting-architecture.md)
### Related Projects
- Crossplane (similar abstraction model)
- Pulumi (infrastructure as code)
- Terraform (state management patterns)
- Flux/ArgoCD (GitOps testing patterns)

View File

@@ -1,201 +0,0 @@
# Pragmatic CI and Testing Roadmap for Harmony
**Status**: Active implementation (March 2026)
**Core Principle**: Self-contained test runner — no dependency on centralized CI servers
All tests are executable via one command:
```bash
cargo run -p harmony_e2e_tests
```
The `harmony_e2e_tests` package:
- Provisions its own infrastructure when needed (k3d, KVM VMs)
- Runs all test tiers in sequence or selectively
- Reports results in text, JSON or JUnit XML
- Works identically on developer laptops, any Linux server, GitHub Actions, GitLab CI, Jenkins, cron jobs, etc.
- Is the single source of truth for what "passing CI" means
## Why This Approach
1. **Portability** — same command & behavior everywhere
2. **Harmony tests Harmony** — the framework validates itself
3. **No vendor lock-in** — GitHub Actions / GitLab CI are just triggers
4. **Perfect reproducibility** — developers reproduce any CI failure locally in seconds
5. **Offline capable** — after initial setup, most tiers run without internet
## Architecture: `harmony_e2e_tests` Package
```
harmony_e2e_tests/
├── Cargo.toml
├── src/
│ ├── main.rs # CLI entry point
│ ├── lib.rs # Test runner core logic
│ ├── tiers/
│ │ ├── mod.rs
│ │ ├── compile_fail.rs # trybuild-based compile-time checks
│ │ ├── unit.rs # cargo test --lib --workspace
│ │ ├── k3d.rs # k3d cluster + parallel example runs
│ │ ├── okd.rs # connect to existing OKD cluster
│ │ └── kvm.rs # full E2E via Harmony's own KVM module
│ ├── mocks/
│ │ ├── mod.rs
│ │ ├── discord.rs # mock Discord webhook receiver
│ │ └── opnsense.rs # mock OPNSense firewall API
│ └── infrastructure/
│ ├── mod.rs
│ ├── k3d.rs # k3d cluster lifecycle
│ └── kvm.rs # helper wrappers around KVM score
└── tests/
├── ui/ # trybuild compile-fail cases (*.rs + *.stderr)
└── fixtures/ # static test data / golden files
```
## CLI Interface ( clap-based )
```bash
# Run everything (default)
cargo run -p harmony_e2e_tests
# Specific tier
cargo run -p harmony_e2e_tests -- --tier k3d
cargo run -p harmony_e2e_tests -- --tier compile
# Filter to one example
cargo run -p harmony_e2e_tests -- --tier k3d --example monitoring
# Parallelism control (k3d tier)
cargo run -p harmony_e2e_tests -- --parallel 8
# Reporting
cargo run -p harmony_e2e_tests -- --report junit.xml
cargo run -p harmony_e2e_tests -- --format json
# Debug helpers
cargo run -p harmony_e2e_tests -- --verbose --dry-run
```
## Test Tiers Ordered by Speed & Cost
| Tier | Duration target | Runner type | What it tests | Isolation strategy |
|------------------|------------------|----------------------|----------------------------------------------------|-----------------------------|
| Compile-fail | < 20 s | Any (GitHub free) | Invalid configs don't compile | Per-file trybuild |
| Unit | < 60 s | Any | Pure Rust logic | cargo test |
| k3d | 815 min | GitHub / self-hosted | 22+ k3d-compatible examples | Fresh k3d cluster + ns-per-example |
| OKD | 1030 min | Self-hosted / CRC | OKD-specific features (Routes, Monitoring CRDs…) | Existing cluster via KUBECONFIG |
| KVM Full E2E | 60180 min | Self-hosted bare-metal | Full HA OKD install + bare-metal scenarios | Harmony KVM score provisions VMs |
### Tier Details & Implementation Notes
1. **Compile-fail**
Uses **`trybuild`** crate (standard in Rust ecosystem).
Place intentional compile errors in `tests/ui/*.rs` with matching `*.stderr` expectation files.
One test function replaces the old custom loop:
```rust
#[test]
fn ui() {
let t = trybuild::TestCases::new();
t.compile_fail("tests/ui/*.rs");
}
```
2. **Unit**
Simple wrapper: `cargo test --lib --workspace -- --nocapture`
Consider `cargo-nextest` later for 23× speedup if test count grows.
3. **k3d**
- Provisions isolated cluster once at start (`k3d cluster create --agents 3 --no-lb --disable traefik`)
- Discovers examples via `[package.metadata.harmony.test-tier = "k3d"]` in `Cargo.toml`
- Runs in parallel with tokio semaphore (default 58 slots)
- Each example gets its own namespace
- Uses `defer` / `scopeguard` for guaranteed cleanup
- Mocks Discord webhook and OPNSense API
4. **OKD**
Connects to pre-provisioned cluster via `KUBECONFIG`.
Validates it is actually OpenShift/OKD before proceeding.
5. **KVM**
Uses **Harmonys own KVM module** to provision test VMs (control-plane + workers + OPNSense).
→ True “dogfooding” — if the E2E fails, the KVM score itself is likely broken.
## CI Integration Patterns
### Fast PR validation (GitHub Actions)
```yaml
name: Fast Tests
on: [push, pull_request]
jobs:
fast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- name: Install Docker & k3d
uses: nolar/setup-k3d-k3s@v1
- run: cargo run -p harmony_e2e_tests -- --tier compile,unit,k3d --report junit.xml
- uses: actions/upload-artifact@v4
with: { name: test-results, path: junit.xml }
```
### Nightly / Merge heavy tests (self-hosted runner)
```yaml
name: Full E2E
on:
schedule: [{ cron: "0 3 * * *" }]
push: { branches: [main] }
jobs:
full:
runs-on: [self-hosted, linux, x64, kvm-capable]
steps:
- uses: actions/checkout@v4
- run: cargo run -p harmony_e2e_tests -- --tier okd,kvm --verbose --report junit.xml
```
## Prerequisites Auto-Check & Install
```rust
// in harmony_e2e_tests/src/infrastructure/prerequisites.rs
async fn ensure_k3d() -> Result<()> { … } // curl | bash if missing
async fn ensure_docker() -> Result<()> { … }
fn check_kvm_support() -> Result<()> { … } // /dev/kvm + libvirt
```
## Success Criteria
### Step 1
- [ ] `harmony_e2e_tests` package created & basic CLI working
- [ ] trybuild compile-fail suite passing
- [ ] First 810 k3d examples running reliably in CI
- [ ] Mock server for Discord webhook completed
### Step 2
- [ ] All 22 k3d-compatible examples green
- [ ] OKD tier running on dedicated self-hosted runner
- [ ] JUnit reporting + GitHub check integration
- [ ] Namespace isolation + automatic retry on transient k8s errors
### Step 3
- [ ] KVM full E2E green on bare-metal runner (nightly)
- [ ] Multi-cluster examples (nats, multisite-postgres) automated
- [ ] Total fast CI time < 12 minutes on GitHub runners
- [ ] Documentation: “How to add a new tested example”
## Quick Start for New Contributors
```bash
# One-time setup
rustup update stable
cargo install trybuild cargo-nextest # optional but recommended
# Run locally (most common)
cargo run -p harmony_e2e_tests -- --tier k3d --verbose
# Just compile checks + unit
cargo test -p harmony_e2e_tests
```

146
CLAUDE.md Normal file
View File

@@ -0,0 +1,146 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Build & Test Commands
```bash
# Full CI check (check + fmt + clippy + test)
./build/check.sh
# Individual commands
cargo check --all-targets --all-features --keep-going
cargo fmt --check # Check formatting
cargo clippy # Lint
cargo test # Run all tests
# Run a single test
cargo test -p <crate_name> <test_name>
# Run a specific example
cargo run -p <example_crate_name>
# Build the mdbook documentation
mdbook build
```
## What Harmony Is
Harmony is the orchestration framework powering NationTech's vision of **decentralized micro datacenters** — small computing clusters deployed in homes, offices, and community spaces instead of hyperscaler facilities. The goal: make computing cleaner, more resilient, locally beneficial, and resistant to centralized points of failure (including geopolitical threats).
Harmony exists because existing IaC tools (Terraform, Ansible, Helm) are trapped in a **YAML mud pit**: static configuration files validated only at runtime, fragmented across tools, with errors surfacing at 3 AM instead of at compile time. Harmony replaces this entire class of tools with a single Rust codebase where **the compiler catches infrastructure misconfigurations before anything is deployed**.
This is not a wrapper around existing tools. It is a paradigm shift: infrastructure-as-real-code with compile-time safety guarantees that no YAML/HCL/DSL-based tool can provide.
## The Score-Topology-Interpret Pattern
This is the core design pattern. Understand it before touching the codebase.
**Score** — declarative desired state. A Rust struct generic over `T: Topology` that describes *what* you want (e.g., "a PostgreSQL cluster", "DNS records for these hosts"). Scores are serializable, cloneable, idempotent.
**Topology** — infrastructure capabilities. Represents *where* things run and *what the environment can do*. Exposes capabilities as traits (`DnsServer`, `K8sclient`, `HelmCommand`, `LoadBalancer`, `Firewall`, etc.). Examples: `K8sAnywhereTopology` (local K3D or any K8s cluster), `HAClusterTopology` (bare-metal HA with redundant firewalls/switches).
**Interpret** — execution glue. Translates a Score into concrete operations against a Topology's capabilities. Returns an `Outcome` (SUCCESS, NOOP, FAILURE, RUNNING, QUEUED, BLOCKED).
**The key insight — compile-time safety through trait bounds:**
```rust
impl<T: Topology + DnsServer + DhcpServer> Score<T> for DnsScore { ... }
```
The compiler rejects any attempt to use `DnsScore` with a Topology that doesn't implement `DnsServer` and `DhcpServer`. Invalid infrastructure configurations become compilation errors, not runtime surprises.
**Higher-order topologies** compose transparently:
- `FailoverTopology<T>` — primary/replica orchestration
- `DecentralizedTopology<T>` — multi-site coordination
If `T: PostgreSQL`, then `FailoverTopology<T>: PostgreSQL` automatically via blanket impls. Zero boilerplate.
## Architecture (Hexagonal)
```
harmony/src/
├── domain/ # Core domain — the heart of the framework
│ ├── score.rs # Score trait (desired state)
│ ├── topology/ # Topology trait + implementations
│ ├── interpret/ # Interpret trait + InterpretName enum (25+ variants)
│ ├── inventory/ # Physical infrastructure metadata (hosts, switches, mgmt interfaces)
│ ├── executors/ # Executor trait definitions
│ └── maestro/ # Orchestration engine (registers scores, manages topology state, executes)
├── infra/ # Infrastructure adapters (driven ports)
│ ├── opnsense/ # OPNsense firewall adapter
│ ├── brocade.rs # Brocade switch adapter
│ ├── kube.rs # Kubernetes executor
│ └── sqlx.rs # Database executor
└── modules/ # Concrete deployment modules (23+)
├── k8s/ # Kubernetes (namespaces, deployments, ingress)
├── postgresql/ # CloudNativePG clusters + multi-site failover
├── okd/ # OpenShift bare-metal from scratch
├── helm/ # Helm chart inflation → vanilla K8s YAML
├── opnsense/ # OPNsense (DHCP, DNS, etc.)
├── monitoring/ # Prometheus, Alertmanager, Grafana
├── kvm/ # KVM virtual machine management
├── network/ # Network services (iPXE, TFTP, bonds)
└── ...
```
Domain types to know: `Inventory` (read-only physical infra context), `Maestro<T>` (orchestrator — calls `topology.ensure_ready()` then executes scores), `Outcome` / `InterpretError` (execution results).
## Key Crates
| Crate | Purpose |
|---|---|
| `harmony` | Core framework: domain, infra adapters, deployment modules |
| `harmony_cli` | CLI + optional TUI (`--features tui`) |
| `harmony_config` | Unified config+secret management (env → SQLite → OpenBao → interactive prompt) |
| `harmony_secret` / `harmony_secret_derive` | Secret backends (LocalFile, OpenBao, Infisical) |
| `harmony_execution` | Execution engine |
| `harmony_agent` / `harmony_inventory_agent` | Persistent agent framework (NATS JetStream mesh), hardware discovery |
| `harmony_assets` | Asset management (URLs, local cache, S3) |
| `harmony_composer` | Infrastructure composition tool |
| `harmony-k8s` | Kubernetes utilities |
| `k3d` | Local K3D cluster management |
| `brocade` | Brocade network switch integration |
## OPNsense Crates
The `opnsense-codegen` and `opnsense-api` crates exist because OPNsense's automation ecosystem is poor — no typed API client exists. These are support crates, not the core of Harmony.
- `opnsense-codegen`: XML model files → IR → Rust structs with serde helpers for OPNsense wire format quirks (`opn_bool` for "0"/"1" strings, `opn_u16`/`opn_u32` for string-encoded numbers). Vendor sources are git submodules under `opnsense-codegen/vendor/`.
- `opnsense-api`: Hand-written `OpnsenseClient` + generated model types in `src/generated/`.
## Key Design Decisions (ADRs in docs/adr/)
- **ADR-001**: Rust chosen for type system, refactoring safety, and performance
- **ADR-002**: Hexagonal architecture — domain isolated from adapters
- **ADR-003**: Infrastructure abstractions at domain level, not provider level (no vendor lock-in)
- **ADR-005**: Custom Rust DSL over YAML/Score-spec — real language, Cargo deps, composable
- **ADR-007**: K3D as default runtime (K8s-certified, lightweight, cross-platform)
- **ADR-009**: Helm charts inflated to vanilla K8s YAML, then deployed via existing code paths
- **ADR-015**: Higher-order topologies via blanket trait impls (zero-cost composition)
- **ADR-016**: Agent-based architecture with NATS JetStream for real-time failover and distributed consensus
- **ADR-020**: Unified config+secret management — Rust struct is the schema, resolution chain: env → store → prompt
## Capability and Score Design Rules
**Capabilities are industry concepts, not tools.** A capability trait represents a standard infrastructure need (e.g., `DnsServer`, `LoadBalancer`, `Router`, `CertificateManagement`) that can be fulfilled by different products. OPNsense provides `DnsServer` today; CoreDNS or Route53 could provide it tomorrow. Scores must not break when the backend changes.
**Exception:** When the developer fundamentally needs to know the implementation. `PostgreSQL` is a capability (not `Database`) because the developer writes PostgreSQL-specific SQL and replication configs. Swapping to MariaDB would break the application, not just the infrastructure.
**Test:** If you could swap the underlying tool without rewriting any Score that uses the capability, the boundary is correct.
**Don't name capabilities after tools.** `SecretVault` not `OpenbaoStore`. `IdentityProvider` not `ZitadelAuth`. Think: what is the core developer need that leads to using this tool?
**Scores encapsulate operational complexity.** Move procedural knowledge (init sequences, retry logic, distribution-specific config) into Scores. A high-level example should be ~15 lines, not ~400 lines of imperative orchestration.
**Scores must be idempotent.** Running twice = same result as once. Use create-or-update, handle "already exists" gracefully.
**Scores must not depend on execution order.** Declare capability requirements via trait bounds, don't assume another Score ran first. If Score B needs what Score A provides, Score B should declare that capability as a trait bound.
See `docs/guides/writing-a-score.md` for the full guide.
## Conventions
- **Rust edition 2024**, resolver v2
- **Conventional commits**: `feat:`, `fix:`, `chore:`, `docs:`, `refactor:`
- **Small PRs**: max ~200 lines (excluding generated code), single-purpose
- **License**: GNU AGPL v3
- **Quality bar**: This framework demands high-quality engineering. The type system is a feature, not a burden. Leverage it. Prefer compile-time guarantees over runtime checks. Abstractions should be domain-level, not provider-specific.

1351
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,8 +1,8 @@
[workspace]
resolver = "2"
members = [
"private_repos/*",
"examples/*",
"private_repos/*",
"harmony",
"harmony_types",
"harmony_macros",
@@ -16,13 +16,17 @@ members = [
"harmony_inventory_agent",
"harmony_secret_derive",
"harmony_secret",
"adr/agent_discovery/mdns",
"network_stress_test",
"examples/kvm_okd_ha_cluster",
"examples/example_linux_vm",
"harmony_config_derive",
"harmony_config",
"brocade",
"harmony_agent",
"harmony_agent/deploy",
"harmony_node_readiness",
"harmony-k8s",
"harmony_e2e_tests",
"harmony_assets", "opnsense-codegen", "opnsense-api",
]
[workspace.package]
@@ -37,8 +41,10 @@ derive-new = "0.7"
async-trait = "0.1"
tokio = { version = "1.40", features = [
"io-std",
"io-util",
"fs",
"macros",
"net",
"rt-multi-thread",
] }
tokio-retry = "0.3.0"
@@ -73,6 +79,7 @@ base64 = "0.22.1"
tar = "0.4.44"
lazy_static = "1.5.0"
directories = "6.0.0"
futures-util = "0.3"
thiserror = "2.0.14"
serde = { version = "1.0.209", features = ["derive", "rc"] }
serde_json = "1.0.127"
@@ -86,3 +93,6 @@ reqwest = { version = "0.12", features = [
"json",
], default-features = false }
assertor = "0.0.4"
tokio-test = "0.4"
anyhow = "1.0"
clap = { version = "4", features = ["derive"] }

272
README.md
View File

@@ -1,101 +1,121 @@
# Harmony
Open-source infrastructure orchestration that treats your platform like first-class code.
**Infrastructure orchestration that treats your platform like first-class code.**
In other words, Harmony is a **next-generation platform engineering framework**.
Harmony is an open-source framework that brings the rigor of software engineering to infrastructure management. Write Rust code to define what you want, and Harmony handles the rest — from local development to production clusters.
_By [NationTech](https://nationtech.io)_
[![Build](https://git.nationtech.io/NationTech/harmony/actions/workflows/check.yml/badge.svg)](https://git.nationtech.io/nationtech/harmony)
[![Build](https://git.nationtech.io/NationTech/harmony/actions/workflows/check.yml/badge.svg)](https://git.nationtech.io/NationTech/harmony)
[![License](https://img.shields.io/badge/license-AGPLv3-blue?style=flat-square)](LICENSE)
### Unify
---
- **Project Scaffolding**
- **Infrastructure Provisioning**
- **Application Deployment**
- **Day-2 operations**
## The Problem Harmony Solves
All in **one strongly-typed Rust codebase**.
Modern infrastructure is messy. Your Kubernetes cluster needs monitoring. Your bare-metal servers need provisioning. Your applications need deployments. Each comes with its own tooling, its own configuration format, and its own failure modes.
### Deploy anywhere
**What if you could describe your entire platform in one consistent language?**
From a **developer laptop** to a **global production cluster**, a single **source of truth** drives the **full software lifecycle.**
That's Harmony. It unifies project scaffolding, infrastructure provisioning, application deployment, and day-2 operations into a single strongly-typed Rust codebase.
## The Harmony Philosophy
---
Infrastructure is essential, but it shouldnt be your core business. Harmony is built on three guiding principles that make modern platforms reliable, repeatable, and easy to reason about.
## Three Principles That Make the Difference
| Principle | What it means for you |
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Infrastructure as Resilient Code** | Replace sprawling YAML and bash scripts with type-safe Rust. Test, refactor, and version your platform just like application code. |
| **Prove It Works Before You Deploy** | Harmony uses the compiler to verify that your applications needs match the target environments capabilities at **compile-time**, eliminating an entire class of runtime outages. |
| **One Unified Model** | Software and infrastructure are a single system. Harmony models them together, enabling deep automation—from bare-metal servers to Kubernetes workloads—with zero context switching. |
| Principle | What It Means |
|-----------|---------------|
| **Infrastructure as Resilient Code** | Stop fighting with YAML and bash. Write type-safe Rust that you can test, version, and refactor like any other code. |
| **Prove It Works Before You Deploy** | Harmony verifies at _compile time_ that your application can actually run on your target infrastructure. No more "the config looks right but it doesn't work" surprises. |
| **One Unified Model** | Software and infrastructure are one system. Deploy from laptop to production cluster without switching contexts or tools. |
These principles surface as simple, ergonomic Rust APIs that let teams focus on their product while trusting the platform underneath.
---
## Where to Start
## How It Works: The Core Concepts
We have a comprehensive set of documentation right here in the repository.
Harmony is built around three concepts that work together:
| I want to... | Start Here |
| ----------------- | ------------------------------------------------------------------ |
| Get Started | [Getting Started Guide](./docs/guides/getting-started.md) |
| See an Example | [Use Case: Deploy a Rust Web App](./docs/use-cases/rust-webapp.md) |
| Explore | [Documentation Hub](./docs/README.md) |
| See Core Concepts | [Core Concepts Explained](./docs/concepts.md) |
### Score — "What You Want"
## Quick Look: Deploy a Rust Webapp
A `Score` is a declarative description of desired state. Think of it as a "recipe" that says _what_ you want without specifying _how_ to get there.
The snippet below spins up a complete **production-grade Rust + Leptos Webapp** with monitoring. Swap it for your own scores to deploy anything from microservices to machine-learning pipelines.
```rust
// "I want a PostgreSQL cluster running with default settings"
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
```
### Topology — "Where It Goes"
A `Topology` represents your infrastructure environment and its capabilities. It answers the question: "What can this environment actually do?"
```rust
// Deploy to a local K3D cluster, or any Kubernetes cluster via environment variables
K8sAnywhereTopology::from_env()
```
### Interpret — "How It Happens"
An `Interpret` is the execution logic that connects your `Score` to your `Topology`. It translates "what you want" into "what the infrastructure does."
**The Compile-Time Check:** Before your code ever runs, Harmony verifies that your `Score` is compatible with your `Topology`. If your application needs a feature your infrastructure doesn't provide, you get a compile error — not a runtime failure.
---
## What You Can Deploy
Harmony ships with ready-made Scores for:
**Data Services**
- PostgreSQL clusters (via CloudNativePG operator)
- Multi-site PostgreSQL with failover
**Kubernetes**
- Namespaces, Deployments, Ingress
- Helm charts
- cert-manager for TLS
- Monitoring (Prometheus, alerting, ntfy)
**Bare Metal / Infrastructure**
- OKD clusters from scratch
- OPNsense firewalls
- Network services (DNS, DHCP, TFTP)
- Brocade switch configuration
**And more:** Application deployment, tenant management, load balancing, and more.
---
## Quick Start: Deploy a PostgreSQL Cluster
This example provisions a local Kubernetes cluster (K3D) and deploys a PostgreSQL cluster on it — no external infrastructure required.
```rust
use harmony::{
inventory::Inventory,
modules::{
application::{
ApplicationScore, RustWebFramework, RustWebapp,
features::{PackagingDeployment, rhob_monitoring::Monitoring},
},
monitoring::alert_channel::discord_alert_channel::DiscordWebhook,
},
modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig},
topology::K8sAnywhereTopology,
};
use harmony_macros::hurl;
use std::{path::PathBuf, sync::Arc};
#[tokio::main]
async fn main() {
let application = Arc::new(RustWebapp {
name: "harmony-example-leptos".to_string(),
project_root: PathBuf::from(".."), // <== Your project root, usually .. if you use the standard `/harmony` folder
framework: Some(RustWebFramework::Leptos),
service_port: 8080,
});
// Define your Application deployment and the features you want
let app = ApplicationScore {
features: vec![
Box::new(PackagingDeployment {
application: application.clone(),
}),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![
Box::new(DiscordWebhook {
name: "test-discord".to_string(),
url: hurl!("https://discord.doesnt.exist.com"), // <== Get your discord webhook url
}),
],
}),
],
application,
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(), // <== Deploy to local automatically provisioned local k3d by default or connect to any kubernetes cluster
vec![Box::new(app)],
K8sAnywhereTopology::from_env(),
vec![Box::new(postgres)],
None,
)
.await
@@ -103,40 +123,128 @@ async fn main() {
}
```
To run this:
### What this actually does
- Clone the repository: `git clone https://git.nationtech.io/nationtech/harmony`
- Install dependencies: `cargo build --release`
- Run the example: `cargo run --example try_rust_webapp`
When you compile and run this program:
1. **Compiles** the Harmony Score into an executable
2. **Connects** to `K8sAnywhereTopology` — which auto-provisions a local K3D cluster if none exists
3. **Installs** the CloudNativePG operator into the cluster (one-time setup)
4. **Creates** a PostgreSQL cluster with 1 instance and 1 GiB of storage
5. **Exposes** the PostgreSQL instance as a Kubernetes Service
### Prerequisites
- [Rust](https://rust-lang.org/tools/install) (edition 2024)
- [Docker](https://docs.docker.com/get-docker/) (for the local K3D cluster)
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (optional, for inspecting the cluster)
### Run it
```bash
# Clone the repository
git clone https://git.nationtech.io/nationtech/harmony
cd harmony
# Build the project
cargo build --release
# Run the example
cargo run -p example-postgresql
```
Harmony will print its progress as it sets up the cluster and deploys PostgreSQL. When complete, you can inspect the deployment:
```bash
kubectl get pods -n harmony-postgres-example
kubectl get secret -n harmony-postgres-example harmony-postgres-example-db-user -o jsonpath='{.data.password}' | base64 -d
```
To connect to the database, forward the port:
```bash
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 5432:5432
psql -h localhost -p 5432 -U postgres
```
To clean up, delete the K3D cluster:
```bash
k3d cluster delete harmony-postgres-example
```
---
## Environment Variables
`K8sAnywhereTopology::from_env()` reads the following environment variables to determine where and how to connect:
| Variable | Default | Description |
|----------|---------|-------------|
| `KUBECONFIG` | `~/.kube/config` | Path to your kubeconfig file |
| `HARMONY_AUTOINSTALL` | `true` | Auto-provision a local K3D cluster if none found |
| `HARMONY_USE_LOCAL_K3D` | `true` | Always prefer local K3D over remote clusters |
| `HARMONY_PROFILE` | `dev` | Deployment profile: `dev`, `staging`, or `prod` |
| `HARMONY_K8S_CONTEXT` | _none_ | Use a specific kubeconfig context |
| `HARMONY_PUBLIC_DOMAIN` | _none_ | Public domain for ingress endpoints |
To connect to an existing Kubernetes cluster instead of provisioning K3D:
```bash
# Point to your kubeconfig
export KUBECONFIG=/path/to/your/kubeconfig
export HARMONY_USE_LOCAL_K3D=false
export HARMONY_AUTOINSTALL=false
# Then run
cargo run -p example-postgresql
```
---
## Documentation
All documentation is in the `/docs` directory.
| I want to... | Start here |
|--------------|------------|
| Understand the core concepts | [Core Concepts](./docs/concepts.md) |
| Deploy my first application | [Getting Started Guide](./docs/guides/getting-started.md) |
| Explore available components | [Scores Catalog](./docs/catalogs/scores.md) · [Topologies Catalog](./docs/catalogs/topologies.md) |
| See a complete bare-metal deployment | [OKD on Bare Metal](./docs/use-cases/okd-on-bare-metal.md) |
| Build my own Score or Topology | [Developer Guide](./docs/guides/developer-guide.md) |
- [Documentation Hub](./docs/README.md): The main entry point for all documentation.
- [Core Concepts](./docs/concepts.md): A detailed look at Score, Topology, Capability, Inventory, and Interpret.
- [Component Catalogs](./docs/catalogs/README.md): Discover all available Scores, Topologies, and Capabilities.
- [Developer Guide](./docs/guides/developer-guide.md): Learn how to write your own Scores and Topologies.
---
## Architectural Decision Records
## Why Rust?
- [ADR-001 · Why Rust](adr/001-rust.md)
- [ADR-003 · Infrastructure Abstractions](adr/003-infrastructure-abstractions.md)
- [ADR-006 · Secret Management](adr/006-secret-management.md)
- [ADR-011 · Multi-Tenant Cluster](adr/011-multi-tenant-cluster.md)
We chose Rust for the same reason you might: **reliability through type safety**.
## Contribute
Infrastructure code runs in production. It needs to be correct. Rust's ownership model and type system let us build a framework where:
Discussions and roadmap live in [Issues](https://git.nationtech.io/nationtech/harmony/-/issues). PRs, ideas, and feedback are welcome!
- Invalid configurations fail at compile time, not at 3 AM
- Refactoring infrastructure is as safe as refactoring application code
- The compiler verifies that your platform can actually fulfill your requirements
See [ADR-001 · Why Rust](./adr/001-rust.md) for our full rationale.
---
## Architecture Decisions
Harmony's design is documented through Architecture Decision Records (ADRs):
- [ADR-001 · Why Rust](./adr/001-rust.md)
- [ADR-003 · Infrastructure Abstractions](./adr/003-infrastructure-abstractions.md)
- [ADR-006 · Secret Management](./adr/006-secret-management.md)
- [ADR-011 · Multi-Tenant Cluster](./adr/011-multi-tenant-cluster.md)
---
## License
Harmony is released under the **GNU AGPL v3**.
> We choose a strong copyleft license to ensure the project—and every improvement to it—remains open and benefits the entire community. Fork it, enhance it, even out-innovate us; just keep it open.
> We choose a strong copyleft license to ensure the project—and every improvement to it—remains open and benefits the entire community.
See [LICENSE](LICENSE) for the full text.
---
_Made with ❤️ & 🦀 by the NationTech and the Harmony community_
_Made with ❤️ & 🦀 by NationTech and the Harmony community_

41
ROADMAP.md Normal file
View File

@@ -0,0 +1,41 @@
# Harmony Roadmap
Eight phases to take Harmony from working prototype to production-ready open-source project.
| # | Phase | Status | Depends On | Detail |
|---|-------|--------|------------|--------|
| 1 | [Harden `harmony_config`](ROADMAP/01-config-crate.md) | Not started | — | Test every source, add SQLite backend, wire Zitadel + OpenBao, validate zero-setup UX |
| 2 | [Migrate to `harmony_config`](ROADMAP/02-refactor-harmony-config.md) | Not started | 1 | Replace all 19 `SecretManager` call sites, deprecate direct `harmony_secret` usage |
| 3 | [Complete `harmony_assets`](ROADMAP/03-assets-crate.md) | Not started | 1, 2 | Test, refactor k3d and OKD to use it, implement `Url::Url`, remove LFS |
| 4 | [Publish to GitHub](ROADMAP/04-publish-github.md) | Not started | 3 | Clean history, set up GitHub as community hub, CI on self-hosted runners |
| 5 | [E2E tests: PostgreSQL & RustFS](ROADMAP/05-e2e-tests-simple.md) | Not started | 1 | k3d-based test harness, two passing E2E tests, CI job |
| 6 | [E2E tests: OKD HA on KVM](ROADMAP/06-e2e-tests-kvm.md) | Not started | 5 | KVM test infrastructure, full OKD installation test, nightly CI |
| 7 | [OPNsense & Bare-Metal Network Automation](ROADMAP/07-opnsense-bare-metal.md) | **In progress** | — | Full OPNsense API coverage, Brocade switch integration, HA cluster network provisioning |
| 8 | [HA OKD Production Deployment](ROADMAP/08-ha-okd-production.md) | Not started | 7 | LAGG/CARP/multi-WAN/BINAT cluster with UpdateHostScore, end-to-end bare-metal automation |
| 9 | [SSO + Config Hardening](ROADMAP/09-sso-config-hardening.md) | **In progress** | 1 | Builder pattern for OpenbaoSecretStore, ZitadelScore PG fix, CoreDNSRewriteScore, integration tests |
## Current State (as of branch `feat/opnsense-codegen`)
- `harmony_config` crate exists with `EnvSource`, `LocalFileSource`, `PromptSource`, `StoreSource`. 12 unit tests. **Zero consumers** in workspace — everything still uses `harmony_secret::SecretManager` directly (19 call sites).
- `harmony_assets` crate exists with `Asset`, `LocalCache`, `LocalStore`, `S3Store`. **No tests. Zero consumers.** The `k3d` crate has its own `DownloadableAsset` with identical functionality and full test coverage.
- `harmony_secret` has `LocalFileSecretStore`, `OpenbaoSecretStore` (token/userpass/OIDC device flow + JWT exchange), `InfisicalSecretStore`. Zitadel OIDC integration **implemented** with session caching.
- **SSO example** (`examples/harmony_sso/`): deploys Zitadel + OpenBao on k3d, provisions identity resources, authenticates via device flow, stores config in OpenBao. `OpenbaoSetupScore` and `ZitadelSetupScore` encapsulate day-two operations.
- KVM module exists on this branch with `KvmExecutor`, VM lifecycle, ISO download, two examples (`example_linux_vm`, `kvm_okd_ha_cluster`).
- RustFS module exists on `feat/rustfs` branch (2 commits ahead of master).
- 39 example crates, **zero E2E tests**. Unit tests pass across workspace (~240 tests).
- CI runs `cargo check`, `fmt`, `clippy`, `test` on Gitea. No E2E job.
### OPNsense & Bare-Metal (as of branch `feat/opnsense-codegen`)
- **9 OPNsense Scores** implemented: VlanScore, LaggScore, VipScore, DnatScore, FirewallRuleScore, OutboundNatScore, BinatScore, NodeExporterScore, OPNsenseShellCommandScore. All tested against a 4-NIC VM.
- **opnsense-codegen** pipeline operational: XML → IR → typed Rust structs with serde helpers. 11 generated API modules (26.5K lines).
- **opnsense-config** has 13 modules: DHCP (dnsmasq), DNS, firewall, LAGG, VIP, VLAN, load balancer (HAProxy), Caddy, TFTP, node exporter, and legacy DHCP.
- **Brocade switch integration** on `feat/brocade-client-add-vlans`: full VLAN CRUD, interface speed config, port-channel management, new `BrocadeSwitchConfigurationScore`. Breaking API changes (InterfaceConfig replaces tuples).
- **Missing for production**: `UpdateHostScore` (update MAC in DHCP for PXE boot + host network setup for LAGG LACP 802.3ad), `HostNetworkConfigurationScore` needs rework for LAGG/LACP (currently only creates bonds, doesn't configure LAGG on OPNsense side), brocade branch needs merge and API adaptation in `harmony/src/infra/brocade.rs`.
## Guiding Principles
- **Zero-setup first**: A new user clones, runs `cargo run`, gets prompted for config, values persist to local SQLite. No env vars, no external services required.
- **Progressive disclosure**: Local SQLite → OpenBao → Zitadel SSO. Each layer is opt-in.
- **Test what ships**: Every example that works should have an E2E test proving it works.
- **Community over infrastructure**: GitHub for engagement, self-hosted runners for CI.

623
ROADMAP/01-config-crate.md Normal file
View File

@@ -0,0 +1,623 @@
# Phase 1: Harden `harmony_config`, Validate UX, Zero-Setup Starting Point
## Goal
Make `harmony_config` production-ready with a seamless first-run experience: clone, run, get prompted, values persist locally. Then progressively add team-scale backends (OpenBao, Zitadel SSO) without changing any calling code.
## Current State
`harmony_config` now has:
- `Config` trait + `#[derive(Config)]` macro
- `ConfigManager` with ordered source chain
- Five `ConfigSource` implementations:
- `EnvSource` — reads `HARMONY_CONFIG_{KEY}` env vars
- `LocalFileSource` — reads/writes `{key}.json` files from a directory
- `SqliteSource`**NEW** reads/writes to SQLite database
- `PromptSource` — returns `None` / no-op on set (placeholder for TUI integration)
- `StoreSource<S: SecretStore>` — wraps any `harmony_secret::SecretStore` backend
- 26 unit tests (mock source, env, local file, sqlite, prompt, integration, store graceful fallback)
- Global `CONFIG_MANAGER` static with `init()`, `get()`, `get_or_prompt()`, `set()`
- Two examples: `basic` and `prompting` in `harmony_config/examples/`
- **Zero workspace consumers** — nothing calls `harmony_config` yet
## Tasks
### 1.1 Add `SqliteSource` as the default zero-setup backend ✅
**Status**: Implemented
**Implementation Details**:
- Database location: `~/.local/share/harmony/config/config.db` (directory is auto-created)
- Schema: `config(key TEXT PRIMARY KEY, value TEXT NOT NULL, updated_at TEXT NOT NULL DEFAULT (datetime('now')))`
- Uses `sqlx` with SQLite runtime
- `SqliteSource::open(path)` - opens/creates database at given path
- `SqliteSource::default()` - uses default Harmony data directory
**Files**:
- `harmony_config/src/source/sqlite.rs` - new file
- `harmony_config/Cargo.toml` - added `sqlx = { workspace = true, features = ["runtime-tokio", "sqlite"] }`
- `Cargo.toml` - added `anyhow = "1.0"` to workspace dependencies
**Tests** (all passing):
- `test_sqlite_set_and_get` — round-trip a `TestConfig` struct
- `test_sqlite_get_returns_none_when_missing` — key not in DB
- `test_sqlite_overwrites_on_set` — set twice, get returns latest
- `test_sqlite_concurrent_access` — two tasks writing different keys simultaneously
### 1.1.1 Add Config example to show exact DX and confirm functionality ✅
**Status**: Implemented
**Examples created**:
1. `harmony_config/examples/basic.rs` - demonstrates:
- Zero-setup SQLite backend (auto-creates directory)
- Using the `#[derive(Config)]` macro
- Environment variable override (`HARMONY_CONFIG_TestConfig` overrides SQLite)
- Direct set/get operations
- Persistence verification
2. `harmony_config/examples/prompting.rs` - demonstrates:
- Config with no defaults (requires user input via `inquire`)
- `get()` flow: env > sqlite > prompt fallback
- `get_or_prompt()` for interactive configuration
- Full resolution chain
- Persistence of prompted values
### 1.2 Make `PromptSource` functional ✅
**Status**: Implemented with design improvement
**Key Finding - Bug Fixed During Implementation**:
The original design had a critical bug in `get_or_prompt()`:
```rust
// OLD (BUGGY) - breaks on first source where set() returns Ok(())
for source in &self.sources {
if source.set(T::KEY, &value).await.is_ok() {
break;
}
}
```
Since `EnvSource.set()` returns `Ok(())` (successfully sets env var), the loop would break immediately and never write to `SqliteSource`. Prompted values were never persisted!
**Solution - Added `should_persist()` method to ConfigSource trait**:
```rust
#[async_trait]
pub trait ConfigSource: Send + Sync {
async fn get(&self, key: &str) -> Result<Option<serde_json::Value>, ConfigError>;
async fn set(&self, key: &str, value: &serde_json::Value) -> Result<(), ConfigError>;
fn should_persist(&self) -> bool {
true
}
}
```
- `EnvSource::should_persist()` returns `false` - shouldn't persist prompted values to env vars
- `PromptSource::should_persist()` returns `false` - doesn't persist anyway
- `get_or_prompt()` now skips sources where `should_persist()` is `false`
**Updated `get_or_prompt()`**:
```rust
for source in &self.sources {
if !source.should_persist() {
continue;
}
if source.set(T::KEY, &value).await.is_ok() {
break;
}
}
```
**Tests**:
- `test_prompt_source_always_returns_none`
- `test_prompt_source_set_is_noop`
- `test_prompt_source_does_not_persist`
- `test_full_chain_with_prompt_source_falls_through_to_prompt`
### 1.3 Integration test: full resolution chain ✅
**Status**: Implemented
**Tests**:
- `test_full_resolution_chain_sqlite_fallback` — env not set, sqlite has value, get() returns sqlite
- `test_full_resolution_chain_env_overrides_sqlite` — env set, sqlite has value, get() returns env
- `test_branch_switching_scenario_deserialization_error` — old struct shape in sqlite returns Deserialization error
### 1.4 Validate Zitadel + OpenBao integration path ⏳
**Status**: Planning phase - detailed execution plan below
**Background**: ADR 020-1 documents the target architecture for Zitadel OIDC + OpenBao integration. This task validates the full chain by deploying Zitadel and OpenBao on a local k3d cluster and demonstrating an end-to-end example.
**Architecture Overview**:
```
┌─────────────────────────────────────────────────────────────────────┐
│ Harmony CLI / App │
│ │
│ ConfigManager: │
│ 1. EnvSource ← HARMONY_CONFIG_* env vars (highest priority) │
│ 2. SqliteSource ← ~/.local/share/harmony/config/config.db │
│ 3. StoreSource ← OpenBao (team-scale, via Zitadel OIDC) │
│ │
│ When StoreSource fails (OpenBao unreachable): │
│ → returns Ok(None), chain falls through to SqliteSource │
└─────────────────────────────────────────────────────────────────────┘
┌──────────────────┐ ┌──────────────────┐
│ Zitadel │ │ OpenBao │
│ (IdP + OIDC) │ │ (Secret Store) │
│ │ │ │
│ Device Auth │────JWT──▶│ JWT Auth │
│ Flow (RFC 8628)│ │ Method │
└──────────────────┘ └──────────────────┘
```
**Prerequisites**:
- Docker running (for k3d)
- Rust toolchain (edition 2024)
- Network access to download Helm charts
- `kubectl` (installed automatically with k3d, or pre-installed)
**Step-by-Step Execution Plan**:
#### Step 1: Create k3d cluster for local development
When you run `cargo run -p example-zitadel` (or any example using `K8sAnywhereTopology::from_env()`), Harmony automatically provisions a k3d cluster if one does not exist. By default:
- `use_local_k3d = true` (env: `HARMONY_USE_LOCAL_K3D`, default `true`)
- `autoinstall = true` (env: `HARMONY_AUTOINSTALL`, default `true`)
- Cluster name: **`harmony`** (hardcoded in `K3DInstallationScore::default()`)
- k3d binary is downloaded to `~/.local/share/harmony/k3d/`
- Kubeconfig is merged into `~/.kube/config`, context set to `k3d-harmony`
No manual `k3d cluster create` is needed. If you want to create the cluster manually first:
```bash
# Install k3d (requires sudo or install to user path)
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
# Create the cluster with the same name Harmony expects
k3d cluster create harmony
kubectl cluster-info --context k3d-harmony
```
**Validation**: `kubectl get nodes --context k3d-harmony` shows 1 server node (k3d default)
**Note**: The existing examples use hardcoded external hostnames (e.g., `sso.sto1.nationtech.io`) for ingress. On a local k3d cluster, these hostnames are not routable. For local development you must either:
- Use `kubectl port-forward` to access services directly
- Configure `/etc/hosts` entries pointing to `127.0.0.1`
- Use a k3d loadbalancer with `--port` mappings
#### Step 2: Deploy Zitadel
Zitadel requires the topology to implement `Topology + K8sclient + HelmCommand + PostgreSQL`. The `K8sAnywhereTopology` satisfies all four.
```bash
cargo run -p example-zitadel
```
**What happens internally** (see `harmony/src/modules/zitadel/mod.rs`):
1. Creates `zitadel` namespace via `K8sResourceScore`
2. Deploys a CNPG PostgreSQL cluster:
- Name: `zitadel-pg`
- Instances: **2** (not 1)
- Storage: 10Gi
- Namespace: `zitadel`
3. Resolves the internal DB endpoint (`host:port`) from the CNPG cluster
4. Generates a 32-byte alphanumeric masterkey, stores it as Kubernetes Secret `zitadel-masterkey` (idempotent: skips if it already exists)
5. Generates a 16-char admin password (guaranteed 1+ uppercase, lowercase, digit, symbol)
6. Deploys Zitadel Helm chart (`zitadel/zitadel` from `https://charts.zitadel.com`):
- `chart_version: None` -- **uses latest chart version** (not pinned)
- No `--wait` flag -- returns before pods are ready
- Ingress annotations are **OpenShift-oriented** (`route.openshift.io/termination: edge`, `cert-manager.io/cluster-issuer: letsencrypt-prod`). On k3d these annotations are silently ignored.
- Ingress includes TLS config with `secretName: "{host}-tls"`, which requires cert-manager. Without cert-manager, TLS termination does not happen at the ingress level.
**Key Helm values set by ZitadelScore**:
- `zitadel.configmapConfig.ExternalDomain`: the `host` field (e.g., `sso.sto1.nationtech.io`)
- `zitadel.configmapConfig.ExternalSecure: true`
- `zitadel.configmapConfig.TLS.Enabled: false` (TLS at ingress, not in Zitadel)
- Admin user: `UserName: "admin"`, Email: **`admin@zitadel.example.com`** (hardcoded, not derived from host)
- Database credentials: injected via `env[].valueFrom.secretKeyRef` from secret `zitadel-pg-superuser` (both user and admin use the same superuser -- there is a TODO to fix this)
**Expected output**:
```
===== ZITADEL DEPLOYMENT COMPLETE =====
Login URL: https://sso.sto1.nationtech.io
Username: admin@zitadel.sso.sto1.nationtech.io
Password: <generated 16-char password>
```
**Note on the success message**: The printed username `admin@zitadel.{host}` does not match the actual configured email `admin@zitadel.example.com`. The actual login username in Zitadel is `admin` (the `UserName` field). This discrepancy exists in the current code.
**Validation on k3d**:
```bash
# Wait for pods to be ready (Helm returns before readiness)
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zitadel -n zitadel --timeout=300s
# Port-forward to access Zitadel (ingress won't work without proper DNS/TLS on k3d)
kubectl port-forward svc/zitadel -n zitadel 8080:8080
# Access at http://localhost:8080 (note: ExternalSecure=true may cause redirect issues)
```
**Known issues for k3d deployment**:
- `ExternalSecure: true` tells Zitadel to expect HTTPS, but k3d port-forward is HTTP. This may cause redirect loops. Override with: modify the example to set `ExternalSecure: false` for local dev.
- The CNPG operator must be installed on the cluster. `K8sAnywhereTopology` handles this via the `PostgreSQL` trait implementation, which deploys the operator first.
#### Step 3: Deploy OpenBao
OpenBao requires only `Topology + K8sclient + HelmCommand` (no PostgreSQL dependency).
```bash
cargo run -p example-openbao
```
**What happens internally** (see `harmony/src/modules/openbao/mod.rs`):
1. `OpenbaoScore` directly delegates to `HelmChartScore.create_interpret()` -- there is no custom `execute()` logic, no namespace creation step, no secret generation
2. Deploys OpenBao Helm chart (`openbao/openbao` from `https://openbao.github.io/openbao-helm`):
- `chart_version: None` -- **uses latest chart version** (not pinned)
- `create_namespace: true` -- the `openbao` namespace is created by Helm
- `install_only: false` -- uses `helm upgrade --install`
**Exact Helm values set by OpenbaoScore**:
```yaml
global:
openshift: true # <-- PROBLEM: hardcoded, see below
server:
standalone:
enabled: true
config: |
ui = true
listener "tcp" {
tls_disable = true
address = "[::]:8200"
cluster_address = "[::]:8201"
}
storage "file" {
path = "/openbao/data"
}
service:
enabled: true
ingress:
enabled: true
hosts:
- host: <host field> # e.g., openbao.sebastien.sto1.nationtech.io
dataStorage:
enabled: true
size: 10Gi
storageClass: null # uses cluster default
accessMode: ReadWriteOnce
auditStorage:
enabled: true
size: 10Gi
storageClass: null
accessMode: ReadWriteOnce
ui:
enabled: true
```
**Critical issue: `global.openshift: true` is hardcoded.** The OpenBao Helm chart default is `global.openshift: false`. When set to `true`, the chart adjusts security contexts and may create OpenShift Routes instead of standard Kubernetes Ingress resources. **On k3d (vanilla k8s), this will produce resources that may not work correctly.** Before deploying on k3d, this must be overridden.
**Fix required for k3d**: Either:
1. Modify `OpenbaoScore` to accept an `openshift: bool` field (preferred long-term fix)
2. Or for this example, create a custom example that passes `values_overrides` with `global.openshift=false`
**Post-deployment initialization** (manual -- the TODO in `mod.rs` acknowledges this is not automated):
OpenBao starts in a sealed state. You must initialize and unseal it manually. See https://openbao.org/docs/platform/k8s/helm/run/
```bash
# Initialize OpenBao (generates unseal keys + root token)
kubectl exec -n openbao openbao-0 -- bao operator init
# Save the output! It contains 5 unseal keys and the root token.
# Example output:
# Unseal Key 1: abc123...
# Unseal Key 2: def456...
# ...
# Initial Root Token: hvs.xxxxx
# Unseal (requires 3 of 5 keys by default)
kubectl exec -n openbao openbao-0 -- bao operator unseal <key1>
kubectl exec -n openbao openbao-0 -- bao operator unseal <key2>
kubectl exec -n openbao openbao-0 -- bao operator unseal <key3>
```
**Validation**:
```bash
kubectl exec -n openbao openbao-0 -- bao status
# Should show "Sealed: false"
```
**Note**: The ingress has **no TLS configuration** (unlike Zitadel's ingress). Access is HTTP-only unless you configure TLS separately.
#### Step 4: Configure OpenBao for Harmony
Two paths are available depending on the authentication method:
##### Path A: Userpass auth (simpler, for local dev)
The current `OpenbaoSecretStore` supports **token** and **userpass** authentication. It does NOT yet implement the JWT/OIDC device flow described in ADR 020-1.
```bash
# Port-forward to access OpenBao API
kubectl port-forward svc/openbao -n openbao 8200:8200 &
export BAO_ADDR="http://127.0.0.1:8200"
export BAO_TOKEN="<root token from init>"
# Enable KV v2 secrets engine (default mount "secret")
bao secrets enable -path=secret kv-v2
# Enable userpass auth method
bao auth enable userpass
# Create a user for Harmony
bao write auth/userpass/login/harmony password="harmony-dev-password"
# Create policy granting read/write on harmony/* paths
cat <<'EOF' | bao policy write harmony-dev -
path "secret/data/harmony/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
path "secret/metadata/harmony/*" {
capabilities = ["list", "read", "delete"]
}
EOF
# Create the user with the policy attached
bao write auth/userpass/users/harmony \
password="harmony-dev-password" \
policies="harmony-dev"
```
**Bug in `OpenbaoSecretStore::authenticate_userpass()`**: The `kv_mount` parameter (default `"secret"`) is passed to `vaultrs::auth::userpass::login()` as the auth mount path. This means it calls `POST /v1/auth/secret/login/{username}` instead of the correct `POST /v1/auth/userpass/login/{username}`. **The auth mount and KV mount are conflated into one parameter.**
**Workaround**: Set `OPENBAO_KV_MOUNT=userpass` so the auth call hits the correct mount path. But then KV operations would use mount `userpass` instead of `secret`, which is wrong.
**Proper fix needed**: Split `kv_mount` into two separate parameters: one for the KV v2 engine mount (`secret`) and one for the auth mount (`userpass`). This is a bug in `harmony_secret/src/store/openbao.rs:234`.
**For this example**: Use **token auth** instead of userpass to sidestep the bug:
```bash
# Set env vars for the example
export OPENBAO_URL="http://127.0.0.1:8200"
export OPENBAO_TOKEN="<root token from init>"
export OPENBAO_KV_MOUNT="secret"
```
##### Path B: JWT auth with Zitadel (target architecture, per ADR 020-1)
This is the production path described in the ADR. It requires the device flow code that is **not yet implemented** in `OpenbaoSecretStore`. The current code only supports token and userpass.
When implemented, the flow will be:
1. Enable JWT auth method in OpenBao
2. Configure it to trust Zitadel's OIDC discovery URL
3. Create a role that maps Zitadel JWT claims to OpenBao policies
```bash
# Enable JWT auth
bao auth enable jwt
# Configure JWT auth to trust Zitadel
bao write auth/jwt/config \
oidc_discovery_url="https://<zitadel-host>" \
bound_issuer="https://<zitadel-host>"
# Create role for Harmony developers
bao write auth/jwt/role/harmony-developer \
role_type="jwt" \
bound_audiences="<harmony_client_id>" \
user_claim="email" \
groups_claim="urn:zitadel:iam:org:project:roles" \
policies="harmony-dev" \
ttl="4h" \
max_ttl="24h" \
token_type="service"
```
**Zitadel application setup** (in Zitadel console):
1. Create project: `Harmony`
2. Add application: `Harmony CLI` (Native app type)
3. Enable Device Authorization grant type
4. Set scopes: `openid email profile offline_access`
5. Note the `client_id`
This path is deferred until the device flow is implemented in `OpenbaoSecretStore`.
#### Step 5: Write end-to-end example
The example uses `StoreSource<OpenbaoSecretStore>` with token auth to avoid the userpass mount bug.
**Environment variables required** (from `harmony_secret/src/config.rs`):
| Variable | Required | Default | Notes |
|---|---|---|---|
| `OPENBAO_URL` | Yes | None | Falls back to `VAULT_ADDR` |
| `OPENBAO_TOKEN` | For token auth | None | Root or user token |
| `OPENBAO_USERNAME` | For userpass | None | Requires `OPENBAO_PASSWORD` too |
| `OPENBAO_PASSWORD` | For userpass | None | |
| `OPENBAO_KV_MOUNT` | No | `"secret"` | KV v2 engine mount path. **Also used as userpass auth mount -- this is a bug.** |
| `OPENBAO_SKIP_TLS` | No | `false` | Set `"true"` to disable TLS verification |
**Note**: `OpenbaoSecretStore::new()` is `async` and **requires a running OpenBao** at construction time (it validates the token if using cached auth). If OpenBao is unreachable during construction, the call will fail. The graceful fallback only applies to `StoreSource::get()` calls after construction -- the `ConfigManager` must be built with a live store, or the store must be wrapped in a lazy initialization pattern.
```rust
// harmony_config/examples/openbao_chain.rs
use harmony_config::{ConfigManager, EnvSource, SqliteSource, StoreSource};
use harmony_secret::OpenbaoSecretStore;
use serde::{Deserialize, Serialize};
use std::sync::Arc;
#[derive(Debug, Clone, Serialize, Deserialize, schemars::JsonSchema, PartialEq)]
struct AppConfig {
host: String,
port: u16,
}
impl harmony_config::Config for AppConfig {
const KEY: &'static str = "AppConfig";
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
env_logger::init();
// Build the source chain
let env_source: Arc<dyn harmony_config::ConfigSource> = Arc::new(EnvSource);
let sqlite = Arc::new(
SqliteSource::default()
.await
.expect("Failed to open SQLite"),
);
// OpenBao store -- requires OPENBAO_URL and OPENBAO_TOKEN env vars
// Falls back gracefully if OpenBao is unreachable at query time
let openbao_url = std::env::var("OPENBAO_URL")
.or(std::env::var("VAULT_ADDR"))
.ok();
let sources: Vec<Arc<dyn harmony_config::ConfigSource>> = if let Some(url) = openbao_url {
let kv_mount = std::env::var("OPENBAO_KV_MOUNT")
.unwrap_or_else(|_| "secret".to_string());
let skip_tls = std::env::var("OPENBAO_SKIP_TLS")
.map(|v| v == "true")
.unwrap_or(false);
match OpenbaoSecretStore::new(
url,
kv_mount,
skip_tls,
std::env::var("OPENBAO_TOKEN").ok(),
std::env::var("OPENBAO_USERNAME").ok(),
std::env::var("OPENBAO_PASSWORD").ok(),
)
.await
{
Ok(store) => {
let store_source = Arc::new(StoreSource::new("harmony".to_string(), store));
vec![env_source, Arc::clone(&sqlite) as _, store_source]
}
Err(e) => {
eprintln!("Warning: OpenBao unavailable ({e}), using local sources only");
vec![env_source, sqlite]
}
}
} else {
println!("No OPENBAO_URL set, using local sources only");
vec![env_source, sqlite]
};
let manager = ConfigManager::new(sources);
// Scenario 1: get() with nothing stored -- returns NotFound
let result = manager.get::<AppConfig>().await;
println!("Get (empty): {:?}", result);
// Scenario 2: set() then get()
let config = AppConfig {
host: "production.example.com".to_string(),
port: 443,
};
manager.set(&config).await?;
println!("Set: {:?}", config);
let retrieved = manager.get::<AppConfig>().await?;
println!("Get (after set): {:?}", retrieved);
assert_eq!(config, retrieved);
println!("End-to-end chain validated!");
Ok(())
}
```
**Key behaviors demonstrated**:
1. **Graceful construction fallback**: If `OPENBAO_URL` is not set or OpenBao is unreachable at startup, the chain is built without it
2. **Graceful query fallback**: `StoreSource::get()` returns `Ok(None)` on any error, so the chain continues to SQLite
3. **Environment override**: `HARMONY_CONFIG_AppConfig='{"host":"env-host","port":9090}'` bypasses all backends
#### Step 6: Validate graceful fallback
Already validated via unit tests (26 tests pass):
- `test_store_source_error_falls_through_to_sqlite` -- `StoreSource` with `AlwaysErrorStore` returns connection error, chain falls through to `SqliteSource`
- `test_store_source_not_found_falls_through_to_sqlite` -- `StoreSource` returns `NotFound`, chain falls through to `SqliteSource`
**Code path (FIXED in `harmony_config/src/source/store.rs`)**:
```rust
// StoreSource::get() -- returns Ok(None) on ANY error, allowing chain to continue
match self.store.get_raw(&self.namespace, key).await {
Ok(bytes) => { /* deserialize and return */ Ok(Some(value)) }
Err(SecretStoreError::NotFound { .. }) => Ok(None),
Err(_) => Ok(None), // Connection errors, timeouts, etc.
}
```
#### Step 7: Known issues and blockers
| Issue | Location | Severity | Status |
|---|---|---|---|
| `global.openshift: true` hardcoded | `harmony/src/modules/openbao/mod.rs:32` | **Blocker for k3d** | ✅ Fixed: Added `openshift: bool` field to `OpenbaoScore` (defaults to `false`) |
| `kv_mount` used as auth mount path | `harmony_secret/src/store/openbao.rs:234` | **Bug** | ✅ Fixed: Added separate `auth_mount` parameter; added `OPENBAO_AUTH_MOUNT` env var |
| Admin email hardcoded `admin@zitadel.example.com` | `harmony/src/modules/zitadel/mod.rs:314` | Minor | Cosmetic mismatch with success message |
| `ExternalSecure: true` hardcoded | `harmony/src/modules/zitadel/mod.rs:306` | **Issue for k3d** | ✅ Fixed: Zitadel now detects Kubernetes distribution and uses appropriate settings (OpenShift = TLS + cert-manager annotations, k3d = plain nginx ingress without TLS) |
| No Helm chart version pinning | Both modules | Risk | Non-deterministic deploys |
| No `--wait` on Helm install | `harmony/src/modules/helm/chart.rs` | UX | Must manually wait for readiness |
| `get_version()`/`get_status()` are `todo!()` | Both modules | Panic risk | Do not call these methods |
| JWT/OIDC device flow not implemented | `harmony_secret/src/store/openbao.rs` | **Gap** | ✅ Implemented: `ZitadelOidcAuth` in `harmony_secret/src/store/zitadel.rs` |
| `HARMONY_SECRET_NAMESPACE` panics if not set | `harmony_secret/src/config.rs:5` | Runtime panic | Only affects `SecretManager`, not `StoreSource` directly |
**Remaining work**:
- [x] `StoreSource<OpenbaoSecretStore>` integration validates compilation
- [x] StoreSource returns `Ok(None)` on connection error (not `Err`)
- [x] Graceful fallback tests pass when OpenBao is unreachable (2 new tests)
- [x] Fix `global.openshift: true` in `OpenbaoScore` for k3d compatibility
- [x] Fix `kv_mount` / auth mount conflation bug in `OpenbaoSecretStore`
- [x] Create and test `harmony_config/examples/openbao_chain.rs` against real k3d deployment
- [x] Implement JWT/OIDC device flow in `OpenbaoSecretStore` (ADR 020-1) — `ZitadelOidcAuth` implemented and wired into `OpenbaoSecretStore::new()` auth chain
- [x] Fix Zitadel distribution detection — Zitadel now uses `k8s_client.get_k8s_distribution()` to detect OpenShift vs k3d and applies appropriate Helm values (TLS + cert-manager for OpenShift, plain nginx for k3d)
### 1.5 UX validation checklist ⏳
**Status**: Partially complete - manual verification needed
- [ ] `cargo run --example postgresql` with no env vars → prompts for nothing
- [ ] An example that uses `SecretManager` today (e.g., `brocade_snmp_server`) → when migrated to `harmony_config`, first run prompts, second run reads from SQLite
- [ ] Setting `HARMONY_CONFIG_BrocadeSwitchAuth='{"host":"...","user":"...","password":"..."}'` → skips prompt, uses env value
- [ ] Deleting `~/.local/share/harmony/config/` directory → re-prompts on next run
## Deliverables
- [x] `SqliteSource` implementation with tests
- [x] Functional `PromptSource` with `should_persist()` design
- [x] Fix `get_or_prompt` to persist to first writable source (via `should_persist()`), not all sources
- [x] Integration tests for full resolution chain
- [x] Branch-switching deserialization failure test
- [x] `StoreSource<OpenbaoSecretStore>` integration validated (compiles, graceful fallback)
- [x] ADR for Zitadel OIDC target architecture
- [ ] Update docs to reflect final implementation and behavior
## Key Implementation Notes
1. **SQLite path**: `~/.local/share/harmony/config/config.db` (not `~/.local/share/harmony/config.db`)
2. **Auto-create directory**: `SqliteSource::open()` creates parent directories if they don't exist
3. **Default path**: `SqliteSource::default()` uses `directories::ProjectDirs` to find the correct data directory
4. **Env var precedence**: Environment variables always take precedence over SQLite in the resolution chain
5. **Testing**: All tests use `tempfile::NamedTempFile` for temporary database paths, ensuring test isolation
6. **Graceful fallback**: `StoreSource::get()` returns `Ok(None)` on any error (connection refused, timeout, etc.), allowing the chain to fall through to the next source. This ensures OpenBao unavailability doesn't break the config chain.
7. **StoreSource errors don't block chain**: When OpenBao is unreachable, `StoreSource::get()` returns `Ok(None)` and the `ConfigManager` continues to the next source (typically `SqliteSource`). This is validated by `test_store_source_error_falls_through_to_sqlite` and `test_store_source_not_found_falls_through_to_sqlite`.

View File

@@ -0,0 +1,112 @@
# Phase 2: Migrate Workspace to `harmony_config`
## Goal
Replace every direct `harmony_secret::SecretManager` call with `harmony_config` equivalents. After this phase, modules and examples depend only on `harmony_config`. `harmony_secret` becomes an internal implementation detail behind `StoreSource`.
## Current State
19 call sites use `SecretManager::get_or_prompt::<T>()` across:
| Location | Secret Types | Call Sites |
|----------|-------------|------------|
| `harmony/src/modules/brocade/brocade_snmp.rs` | `BrocadeSnmpAuth`, `BrocadeSwitchAuth` | 2 |
| `harmony/src/modules/nats/score_nats_k8s.rs` | `NatsAdmin` | 1 |
| `harmony/src/modules/okd/bootstrap_02_bootstrap.rs` | `RedhatSecret`, `SshKeyPair` | 2 |
| `harmony/src/modules/application/features/monitoring.rs` | `NtfyAuth` | 1 |
| `brocade/examples/main.rs` | `BrocadeSwitchAuth` | 1 |
| `examples/okd_installation/src/main.rs` + `topology.rs` | `SshKeyPair`, `BrocadeSwitchAuth`, `OPNSenseFirewallConfig` | 3 |
| `examples/okd_pxe/src/main.rs` + `topology.rs` | `SshKeyPair`, `BrocadeSwitchAuth`, `OPNSenseFirewallCredentials` | 3 |
| `examples/opnsense/src/main.rs` | `OPNSenseFirewallCredentials` | 1 |
| `examples/sttest/src/main.rs` + `topology.rs` | `SshKeyPair`, `OPNSenseFirewallConfig` | 2 |
| `examples/opnsense_node_exporter/` | (has dep but unclear usage) | ~1 |
| `examples/okd_cluster_alerts/` | (has dep but unclear usage) | ~1 |
| `examples/brocade_snmp_server/` | (has dep but unclear usage) | ~1 |
## Tasks
### 2.1 Bootstrap `harmony_config` in CLI and TUI entry points
Add `harmony_config::init()` as the first thing that happens in `harmony_cli::run()` and `harmony_tui::run()`.
```rust
// harmony_cli/src/lib.rs — inside run()
pub async fn run<T: Topology + Send + Sync + 'static>(
inventory: Inventory,
topology: T,
scores: Vec<Box<dyn Score<T>>>,
args_struct: Option<Args>,
) -> Result<(), Box<dyn std::error::Error>> {
// Initialize config system with default source chain
let sqlite = Arc::new(SqliteSource::default().await?);
let env = Arc::new(EnvSource);
harmony_config::init(vec![env, sqlite]).await;
// ... rest of run()
}
```
This replaces the implicit `SecretManager` lazy initialization that currently happens on first `get_or_prompt` call.
### 2.2 Migrate each secret type from `Secret` to `Config`
For each secret struct, change:
```rust
// Before
use harmony_secret::Secret;
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, InteractiveParse, Secret)]
struct BrocadeSwitchAuth { ... }
// After
use harmony_config::Config;
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, InteractiveParse, Config)]
struct BrocadeSwitchAuth { ... }
```
At each call site, change:
```rust
// Before
let config = SecretManager::get_or_prompt::<BrocadeSwitchAuth>().await.unwrap();
// After
let config = harmony_config::get_or_prompt::<BrocadeSwitchAuth>().await.unwrap();
```
### 2.3 Migration order (low risk to high risk)
1. **`brocade/examples/main.rs`** — 1 call site, isolated example, easy to test manually
2. **`examples/opnsense/src/main.rs`** — 1 call site, isolated
3. **`harmony/src/modules/brocade/brocade_snmp.rs`** — 2 call sites, core module but straightforward
4. **`harmony/src/modules/nats/score_nats_k8s.rs`** — 1 call site
5. **`harmony/src/modules/application/features/monitoring.rs`** — 1 call site
6. **`examples/sttest/`** — 2 call sites, has both main.rs and topology.rs patterns
7. **`examples/okd_installation/`** — 3 call sites, complex topology setup
8. **`examples/okd_pxe/`** — 3 call sites, similar to okd_installation
9. **`harmony/src/modules/okd/bootstrap_02_bootstrap.rs`** — 2 call sites, critical OKD bootstrap path
### 2.4 Remove `harmony_secret` from direct dependencies
After all call sites are migrated:
1. Remove `harmony_secret` from `Cargo.toml` of: `harmony`, `brocade`, and all examples that had it
2. `harmony_config` keeps `harmony_secret` as a dependency (for `StoreSource`)
3. The `Secret` trait and `SecretManager` remain in `harmony_secret` but are not used directly anymore
### 2.5 Backward compatibility for existing local secrets
Users who already have secrets stored via `LocalFileSecretStore` (JSON files in `~/.local/share/harmony/secrets/`) need a migration path:
- On first run after upgrade, if SQLite has no entry for a key but the old JSON file exists, read from JSON and write to SQLite
- Or: add `LocalFileSource` as a fallback source at the end of the chain (read-only) for one release cycle
- Log a deprecation warning when reading from old JSON files
## Deliverables
- [ ] `harmony_config::init()` called in `harmony_cli::run()` and `harmony_tui::run()`
- [ ] All 19 call sites migrated from `SecretManager` to `harmony_config`
- [ ] `harmony_secret` removed from direct dependencies of `harmony`, `brocade`, and all examples
- [ ] Backward compatibility for existing local JSON secrets
- [ ] All existing unit tests still pass
- [ ] Manual verification: one migrated example works end-to-end (prompt → persist → read)

141
ROADMAP/03-assets-crate.md Normal file
View File

@@ -0,0 +1,141 @@
# Phase 3: Complete `harmony_assets`, Refactor Consumers
## Goal
Make `harmony_assets` the single way to manage downloadable binaries and images across Harmony. Eliminate `k3d::DownloadableAsset` duplication, implement `Url::Url` in OPNsense infra, remove LFS-tracked files from git.
## Current State
- `harmony_assets` exists with `Asset`, `LocalCache`, `LocalStore`, `S3Store` (behind feature flag). CLI with `upload`, `download`, `checksum`, `verify` commands. **No tests. Zero consumers.**
- `k3d/src/downloadable_asset.rs` has the same functionality with full test coverage (httptest mock server, checksum verification, cache hit, 404 handling, checksum failure).
- `Url::Url` variant in `harmony_types/src/net.rs` exists but is `todo!()` in OPNsense TFTP and HTTP infra layers.
- OKD modules hardcode `./data/...` paths (`bootstrap_02_bootstrap.rs:84-88`, `ipxe.rs:73`).
- `data/` directory contains ~3GB of LFS-tracked files (OKD binaries, PXE images, SCOS images).
## Tasks
### 3.1 Port k3d tests to `harmony_assets`
The k3d crate has 5 well-written tests in `downloadable_asset.rs`. Port them to test `harmony_assets::LocalStore`:
```rust
// harmony_assets/tests/local_store.rs (or in src/ as unit tests)
#[tokio::test]
async fn test_fetch_downloads_and_verifies_checksum() {
// Start httptest server serving a known file
// Create Asset with URL pointing to mock server
// Fetch via LocalStore
// Assert file exists at expected cache path
// Assert checksum matches
}
#[tokio::test]
async fn test_fetch_returns_cached_file_when_present() {
// Pre-populate cache with correct file
// Fetch — assert no HTTP request made (mock server not hit)
}
#[tokio::test]
async fn test_fetch_fails_on_404() { ... }
#[tokio::test]
async fn test_fetch_fails_on_checksum_mismatch() { ... }
#[tokio::test]
async fn test_fetch_with_progress_callback() {
// Assert progress callback is called with (bytes_received, total_size)
}
```
Add `httptest` to `[dev-dependencies]` of `harmony_assets`.
### 3.2 Refactor `k3d` to use `harmony_assets`
Replace `k3d/src/downloadable_asset.rs` with calls to `harmony_assets`:
```rust
// k3d/src/lib.rs — in download_latest_release()
use harmony_assets::{Asset, LocalCache, LocalStore, ChecksumAlgo};
let asset = Asset::new(
binary_url,
checksum,
ChecksumAlgo::SHA256,
K3D_BIN_FILE_NAME.to_string(),
);
let cache = LocalCache::new(self.base_dir.clone());
let store = LocalStore::new();
let path = store.fetch(&asset, &cache, None).await
.map_err(|e| format!("Failed to download k3d: {}", e))?;
```
Delete `k3d/src/downloadable_asset.rs`. Update k3d's `Cargo.toml` to depend on `harmony_assets`.
### 3.3 Define asset metadata as config structs
Following `plan.md` Phase 2, create typed config for OKD assets using `harmony_config`:
```rust
// harmony/src/modules/okd/config.rs
#[derive(Config, Serialize, Deserialize, JsonSchema, InteractiveParse)]
struct OkdInstallerConfig {
pub openshift_install_url: String,
pub openshift_install_sha256: String,
pub scos_kernel_url: String,
pub scos_kernel_sha256: String,
pub scos_initramfs_url: String,
pub scos_initramfs_sha256: String,
pub scos_rootfs_url: String,
pub scos_rootfs_sha256: String,
}
```
First run prompts for URLs/checksums (or uses compiled-in defaults). Values persist to SQLite. Can be overridden via env vars or OpenBao.
### 3.4 Implement `Url::Url` in OPNsense infra layer
In `harmony/src/infra/opnsense/http.rs` and `tftp.rs`, implement the `Url::Url(url)` match arm:
```rust
// Instead of SCP-ing files to OPNsense:
// SSH into OPNsense, run: fetch -o /usr/local/http/{path} {url}
// (FreeBSD-native HTTP client, no extra deps on OPNsense)
```
This eliminates the manual `scp` workaround and the `inquire::Confirm` prompts in `ipxe.rs:126` and `bootstrap_02_bootstrap.rs:230`.
### 3.5 Refactor OKD modules to use assets + config
In `bootstrap_02_bootstrap.rs`:
- `openshift-install`: Resolve `OkdInstallerConfig` from `harmony_config`, download via `harmony_assets`, invoke from cache.
- SCOS images: Pass `Url::Url(scos_kernel_url)` etc. to `StaticFilesHttpScore`. OPNsense fetches from S3 directly.
- Remove `oc` and `kubectl` from `data/okd/bin/` (never used by code).
In `ipxe.rs`:
- Replace the folder-to-serve SCP workaround with individual `Url::Url` entries.
- Remove the `inquire::Confirm` SCP prompts.
### 3.6 Upload assets to S3
- Upload all current `data/` binaries to Ceph S3 bucket with path scheme: `harmony-assets/okd/v{version}/openshift-install`, `harmony-assets/pxe/centos-stream-9/install.img`, etc.
- Set public-read ACL or configure presigned URL generation.
- Record S3 URLs and SHA256 checksums as defaults in the config structs.
### 3.7 Remove LFS, clean git
- Remove all LFS-tracked files from the repo.
- Update `.gitattributes` to remove LFS filters.
- Keep `data/` in `.gitignore` (it becomes a local cache directory).
- Optionally use `git filter-repo` or BFG to strip LFS objects from history (required before Phase 4 GitHub publish).
## Deliverables
- [ ] `harmony_assets` has tests ported from k3d pattern (5+ tests with httptest)
- [ ] `k3d::DownloadableAsset` replaced by `harmony_assets` usage
- [ ] `OkdInstallerConfig` struct using `harmony_config`
- [ ] `Url::Url` implemented in OPNsense HTTP and TFTP infra
- [ ] OKD bootstrap refactored to use lazy-download pattern
- [ ] Assets uploaded to S3 with documented URLs/checksums
- [ ] LFS removed, git history cleaned
- [ ] Repo size small enough for GitHub (~code + templates only)

View File

@@ -0,0 +1,110 @@
# Phase 4: Publish to GitHub
## Goal
Make Harmony publicly available on GitHub as the primary community hub for issues, pull requests, and discussions. CI runs on self-hosted runners.
## Prerequisites
- Phase 3 complete: LFS removed, git history cleaned, repo is small
- README polished with quick-start, architecture overview, examples
- All existing tests pass
## Tasks
### 4.1 Clean git history
```bash
# Option A: git filter-repo (preferred)
git filter-repo --strip-blobs-bigger-than 10M
# Option B: BFG Repo Cleaner
bfg --strip-blobs-bigger-than 10M
git reflog expire --expire=now --all
git gc --prune=now --aggressive
```
Verify final repo size is reasonable (target: <50MB including all code, docs, templates).
### 4.2 Create GitHub repository
- Create `NationTech/harmony` (or chosen org/name) on GitHub
- Push cleaned repo as initial commit
- Set default branch to `main` (rename from `master` if desired)
### 4.3 Set up CI on self-hosted runners
GitHub is the community hub, but CI runs on your own infrastructure. Options:
**Option A: GitHub Actions with self-hosted runners**
- Register your Gitea runner machines as GitHub Actions self-hosted runners
- Port `.gitea/workflows/check.yml` to `.github/workflows/check.yml`
- Same Docker image (`hub.nationtech.io/harmony/harmony_composer:latest`), same commands
- Pro: native GitHub PR checks, no external service needed
- Con: runners need outbound access to GitHub API
**Option B: External CI (Woodpecker, Drone, Jenkins)**
- Use any CI that supports webhooks from GitHub
- Report status back to GitHub via commit status API / checks API
- Pro: fully self-hosted, no GitHub dependency for builds
- Con: extra integration work
**Option C: Keep Gitea CI, mirror from GitHub**
- GitHub repo has a webhook that triggers Gitea CI on push
- Gitea reports back to GitHub via commit status API
- Pro: no migration of CI config
- Con: fragile webhook chain
**Recommendation**: Option A. GitHub Actions self-hosted runners are straightforward and give the best contributor UX (native PR checks). The workflow files are nearly identical to Gitea workflows.
```yaml
# .github/workflows/check.yml
name: Check
on: [push, pull_request]
jobs:
check:
runs-on: self-hosted
container:
image: hub.nationtech.io/harmony/harmony_composer:latest
steps:
- uses: actions/checkout@v4
- run: bash build/check.sh
```
### 4.4 Polish documentation
- **README.md**: Quick-start (clone run get prompted see result), architecture diagram (Score Interpret Topology), link to docs and examples
- **CONTRIBUTING.md**: Already exists. Review for GitHub-specific guidance (fork workflow, PR template)
- **docs/**: Already comprehensive. Verify links work on GitHub rendering
- **Examples**: Ensure each example has a one-line description in its `Cargo.toml` and a comment block in `main.rs`
### 4.5 License and legal
- Verify workspace `license` field in root `Cargo.toml` is set correctly
- Add `LICENSE` file at repo root if not present
- Scan for any proprietary dependencies or hardcoded internal URLs
### 4.6 GitHub repository configuration
- Branch protection on `main`: require PR review, require CI to pass
- Issue templates: bug report, feature request
- PR template: checklist (tests pass, docs updated, etc.)
- Topics/tags: `rust`, `infrastructure-as-code`, `kubernetes`, `orchestration`, `bare-metal`
- Repository description: "Infrastructure orchestration framework. Declare what you want (Score), describe your infrastructure (Topology), let Harmony figure out how."
### 4.7 Gitea as internal mirror
- Set up Gitea to mirror from GitHub (pull mirror)
- Internal CI can continue running on Gitea for private/experimental branches
- Public contributions flow through GitHub
## Deliverables
- [ ] Git history cleaned, repo size <50MB
- [ ] Public GitHub repository created
- [ ] CI running on self-hosted runners with GitHub Actions
- [ ] Branch protection enabled
- [ ] README polished with quick-start guide
- [ ] Issue and PR templates created
- [ ] LICENSE file present
- [ ] Gitea configured as mirror

View File

@@ -0,0 +1,255 @@
# Phase 5: E2E Tests for PostgreSQL & RustFS
## Goal
Establish an automated E2E test pipeline that proves working examples actually work. Start with the two simplest k8s-based examples: PostgreSQL and RustFS.
## Prerequisites
- Phase 1 complete (config crate works, bootstrap is clean)
- `feat/rustfs` branch merged
## Architecture
### Test harness: `tests/e2e/`
A dedicated workspace member crate at `tests/e2e/` that contains:
1. **Shared k3d utilities** — create/destroy clusters, wait for readiness
2. **Per-example test modules** — each example gets a `#[tokio::test]` function
3. **Assertion helpers** — wait for pods, check CRDs exist, verify services
```
tests/
e2e/
Cargo.toml
src/
lib.rs # Shared test utilities
k3d.rs # k3d cluster lifecycle
k8s_assert.rs # K8s assertion helpers
tests/
postgresql.rs # PostgreSQL E2E test
rustfs.rs # RustFS E2E test
```
### k3d cluster lifecycle
```rust
// tests/e2e/src/k3d.rs
use k3d_rs::K3d;
pub struct TestCluster {
pub name: String,
pub k3d: K3d,
pub client: kube::Client,
reuse: bool,
}
impl TestCluster {
/// Creates a k3d cluster for testing.
/// If HARMONY_E2E_REUSE_CLUSTER=1, reuses existing cluster.
pub async fn ensure(name: &str) -> Result<Self, String> {
let reuse = std::env::var("HARMONY_E2E_REUSE_CLUSTER")
.map(|v| v == "1")
.unwrap_or(false);
let base_dir = PathBuf::from("/tmp/harmony-e2e");
let k3d = K3d::new(base_dir, Some(name.to_string()));
let client = k3d.ensure_installed().await?;
Ok(Self { name: name.to_string(), k3d, client, reuse })
}
/// Returns the kubeconfig path for this cluster.
pub fn kubeconfig_path(&self) -> String { ... }
}
impl Drop for TestCluster {
fn drop(&mut self) {
if !self.reuse {
// Best-effort cleanup
let _ = self.k3d.run_k3d_command(["cluster", "delete", &self.name]);
}
}
}
```
### K8s assertion helpers
```rust
// tests/e2e/src/k8s_assert.rs
/// Wait until a pod matching the label selector is Running in the namespace.
/// Times out after `timeout` duration.
pub async fn wait_for_pod_running(
client: &kube::Client,
namespace: &str,
label_selector: &str,
timeout: Duration,
) -> Result<(), String>
/// Assert a CRD instance exists.
pub async fn assert_resource_exists<K: kube::Resource>(
client: &kube::Client,
name: &str,
namespace: Option<&str>,
) -> Result<(), String>
/// Install a Helm chart. Returns when all pods in the release are running.
pub async fn helm_install(
release_name: &str,
chart: &str,
namespace: &str,
repo_url: Option<&str>,
timeout: Duration,
) -> Result<(), String>
```
## Tasks
### 5.1 Create the `tests/e2e/` crate
Add to workspace `Cargo.toml`:
```toml
[workspace]
members = [
# ... existing members
"tests/e2e",
]
```
`tests/e2e/Cargo.toml`:
```toml
[package]
name = "harmony-e2e-tests"
edition = "2024"
publish = false
[dependencies]
harmony = { path = "../../harmony" }
harmony_cli = { path = "../../harmony_cli" }
harmony_types = { path = "../../harmony_types" }
k3d_rs = { path = "../../k3d", package = "k3d_rs" }
kube = { workspace = true }
k8s-openapi = { workspace = true }
tokio = { workspace = true }
log = { workspace = true }
env_logger = { workspace = true }
[dev-dependencies]
pretty_assertions = { workspace = true }
```
### 5.2 PostgreSQL E2E test
```rust
// tests/e2e/tests/postgresql.rs
use harmony::modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig};
use harmony::topology::K8sAnywhereTopology;
use harmony::inventory::Inventory;
use harmony::maestro::Maestro;
#[tokio::test]
async fn test_postgresql_deploys_on_k3d() {
let cluster = TestCluster::ensure("harmony-e2e-pg").await.unwrap();
// Install CNPG operator via Helm
// (K8sAnywhereTopology::ensure_ready() now handles this since
// commit e1183ef "K8s postgresql score now ensures cnpg is installed")
// But we may need the Helm chart for non-OKD:
helm_install(
"cnpg",
"cloudnative-pg",
"cnpg-system",
Some("https://cloudnative-pg.github.io/charts"),
Duration::from_secs(120),
).await.unwrap();
// Configure topology pointing to test cluster
let config = K8sAnywhereConfig {
kubeconfig: Some(cluster.kubeconfig_path()),
use_local_k3d: false,
autoinstall: false,
use_system_kubeconfig: false,
harmony_profile: "dev".to_string(),
k8s_context: None,
};
let topology = K8sAnywhereTopology::with_config(config);
// Create and run the score
let score = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "e2e-test-pg".to_string(),
namespace: "e2e-pg-test".to_string(),
..Default::default()
},
};
let mut maestro = Maestro::initialize(Inventory::autoload(), topology).await.unwrap();
maestro.register_all(vec![Box::new(score)]);
let scores = maestro.scores().read().unwrap().first().unwrap().clone_box();
let result = maestro.interpret(scores).await;
assert!(result.is_ok(), "PostgreSQL score failed: {:?}", result.err());
// Assert: CNPG Cluster resource exists
// (the Cluster CRD is applied — pod readiness may take longer)
let client = cluster.client.clone();
// ... assert Cluster CRD exists in e2e-pg-test namespace
}
```
### 5.3 RustFS E2E test
Similar structure. Details depend on what the RustFS score deploys (likely a Helm chart or k8s resources for MinIO/RustFS).
```rust
#[tokio::test]
async fn test_rustfs_deploys_on_k3d() {
let cluster = TestCluster::ensure("harmony-e2e-rustfs").await.unwrap();
// ... similar pattern: configure topology, create score, interpret, assert
}
```
### 5.4 CI job for E2E tests
New workflow file (Gitea or GitHub Actions):
```yaml
# .gitea/workflows/e2e.yml (or .github/workflows/e2e.yml)
name: E2E Tests
on:
push:
branches: [master, main]
# Don't run on every PR — too slow. Run on label or manual trigger.
workflow_dispatch:
jobs:
e2e:
runs-on: self-hosted # Must have Docker available for k3d
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- name: Install k3d
run: curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
- name: Run E2E tests
run: cargo test -p harmony-e2e-tests -- --test-threads=1
env:
RUST_LOG: info
```
Note `--test-threads=1`: E2E tests create k3d clusters and should not run in parallel (port conflicts, resource contention).
## Deliverables
- [ ] `tests/e2e/` crate added to workspace
- [ ] Shared test utilities: `TestCluster`, `wait_for_pod_running`, `helm_install`
- [ ] PostgreSQL E2E test passing
- [ ] RustFS E2E test passing (after `feat/rustfs` merge)
- [ ] CI job running E2E tests on push to main
- [ ] `HARMONY_E2E_REUSE_CLUSTER=1` for fast local iteration

214
ROADMAP/06-e2e-tests-kvm.md Normal file
View File

@@ -0,0 +1,214 @@
# Phase 6: E2E Tests for OKD HA Cluster on KVM
## Goal
Prove the full OKD bare-metal installation flow works end-to-end using KVM virtual machines. This is the ultimate validation of Harmony's core value proposition: declare an OKD cluster, point it at infrastructure, watch it materialize.
## Prerequisites
- Phase 5 complete (test harness exists, k3d tests passing)
- `feature/kvm-module` merged to main
- A CI runner with libvirt/KVM access and nested virtualization support
## Architecture
The KVM branch already has a `kvm_okd_ha_cluster` example that creates:
```
Host bridge (WAN)
|
+--------------------+
| OPNsense | 192.168.100.1
| gateway + PXE |
+--------+-----------+
|
harmonylan (192.168.100.0/24)
+---------+---------+---------+---------+
| | | | |
+----+---+ +---+---+ +---+---+ +---+---+ +--+----+
| cp0 | | cp1 | | cp2 | |worker0| |worker1|
| .10 | | .11 | | .12 | | .20 | | .21 |
+--------+ +-------+ +-------+ +-------+ +---+---+
|
+-----+----+
| worker2 |
| .22 |
+----------+
```
The test needs to orchestrate this entire setup, wait for OKD to converge, and assert the cluster is healthy.
## Tasks
### 6.1 Start with `example_linux_vm` — the simplest KVM test
Before tackling the full OKD stack, validate the KVM module itself with the simplest possible test:
```rust
// tests/e2e/tests/kvm_linux_vm.rs
#[tokio::test]
#[ignore] // Requires libvirt access — run with: cargo test -- --ignored
async fn test_linux_vm_boots_from_iso() {
let executor = KvmExecutor::from_env().unwrap();
// Create isolated network
let network = NetworkConfig {
name: "e2e-test-net".to_string(),
bridge: "virbr200".to_string(),
// ...
};
executor.ensure_network(&network).await.unwrap();
// Define and start VM
let vm_config = VmConfig::builder("e2e-linux-test")
.vcpus(1)
.memory_gb(1)
.disk(5)
.network(NetworkRef::named("e2e-test-net"))
.cdrom("https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso")
.boot_order([BootDevice::Cdrom, BootDevice::Disk])
.build();
executor.ensure_vm(&vm_config).await.unwrap();
executor.start_vm("e2e-linux-test").await.unwrap();
// Assert VM is running
let status = executor.vm_status("e2e-linux-test").await.unwrap();
assert_eq!(status, VmStatus::Running);
// Cleanup
executor.destroy_vm("e2e-linux-test").await.unwrap();
executor.undefine_vm("e2e-linux-test").await.unwrap();
executor.delete_network("e2e-test-net").await.unwrap();
}
```
This test validates:
- ISO download works (via `harmony_assets` if refactored, or built-in KVM module download)
- libvirt XML generation is correct
- VM lifecycle (define → start → status → destroy → undefine)
- Network creation/deletion
### 6.2 OKD HA Cluster E2E test
The full integration test. This is long-running (30-60 minutes) and should only run nightly or on-demand.
```rust
// tests/e2e/tests/kvm_okd_ha.rs
#[tokio::test]
#[ignore] // Requires KVM + significant resources. Run nightly.
async fn test_okd_ha_cluster_on_kvm() {
// 1. Create virtual infrastructure
// - OPNsense gateway VM
// - 3 control plane VMs
// - 3 worker VMs
// - Virtual network (harmonylan)
// 2. Run OKD installation scores
// (the kvm_okd_ha_cluster example, but as a test)
// 3. Wait for OKD API server to become reachable
// - Poll https://api.okd.harmonylan:6443 until it responds
// - Timeout: 30 minutes
// 4. Assert cluster health
// - All nodes in Ready state
// - ClusterVersion reports Available=True
// - Sample workload (nginx) deploys and pod reaches Running
// 5. Cleanup
// - Destroy all VMs
// - Delete virtual networks
// - Clean up disk images
}
```
### 6.3 CI runner requirements
The KVM E2E test needs a runner with:
- **Hardware**: 32GB+ RAM, 8+ CPU cores, 100GB+ disk
- **Software**: libvirt, QEMU/KVM, `virsh`, nested virtualization enabled
- **Network**: Outbound internet access (to download ISOs, OKD images)
- **Permissions**: User in `libvirt` group, or root access
Options:
- **Dedicated bare-metal machine** registered as a self-hosted GitHub Actions runner
- **Cloud VM with nested virt** (e.g., GCP n2-standard-8 with `--enable-nested-virtualization`)
- **Manual trigger only** — developer runs locally, CI just tracks pass/fail
### 6.4 Nightly CI job
```yaml
# .github/workflows/e2e-kvm.yml
name: E2E KVM Tests
on:
schedule:
- cron: '0 2 * * *' # 2 AM daily
workflow_dispatch: # Manual trigger
jobs:
kvm-tests:
runs-on: [self-hosted, kvm] # Label for KVM-capable runners
timeout-minutes: 90
steps:
- uses: actions/checkout@v4
- name: Run KVM E2E tests
run: cargo test -p harmony-e2e-tests -- --ignored --test-threads=1
env:
RUST_LOG: info
HARMONY_KVM_URI: qemu:///system
- name: Cleanup VMs on failure
if: failure()
run: |
virsh list --all --name | grep e2e | xargs -I {} virsh destroy {} || true
virsh list --all --name | grep e2e | xargs -I {} virsh undefine {} --remove-all-storage || true
```
### 6.5 Test resource management
KVM tests create real resources that must be cleaned up even on failure. Implement a test fixture pattern:
```rust
struct KvmTestFixture {
executor: KvmExecutor,
vms: Vec<String>,
networks: Vec<String>,
}
impl KvmTestFixture {
fn track_vm(&mut self, name: &str) { self.vms.push(name.to_string()); }
fn track_network(&mut self, name: &str) { self.networks.push(name.to_string()); }
}
impl Drop for KvmTestFixture {
fn drop(&mut self) {
// Best-effort cleanup of all tracked resources
for vm in &self.vms {
let _ = std::process::Command::new("virsh")
.args(["destroy", vm]).output();
let _ = std::process::Command::new("virsh")
.args(["undefine", vm, "--remove-all-storage"]).output();
}
for net in &self.networks {
let _ = std::process::Command::new("virsh")
.args(["net-destroy", net]).output();
let _ = std::process::Command::new("virsh")
.args(["net-undefine", net]).output();
}
}
}
```
## Deliverables
- [ ] `test_linux_vm_boots_from_iso` — passing KVM smoke test
- [ ] `test_okd_ha_cluster_on_kvm` — full OKD installation test
- [ ] `KvmTestFixture` with resource cleanup on test failure
- [ ] Nightly CI job on KVM-capable runner
- [ ] Force-cleanup script for leaked VMs/networks
- [ ] Documentation: how to set up a KVM runner for E2E tests

View File

@@ -0,0 +1,57 @@
# Phase 7: OPNsense & Bare-Metal Network Automation
## Goal
Complete the OPNsense API coverage and Brocade switch integration to enable fully automated bare-metal HA cluster provisioning with LAGG, CARP VIP, multi-WAN, and BINAT.
## Status: In Progress
### Done
- opnsense-codegen pipeline: XML model parsing, IR generation, Rust code generation with serde helpers
- 11 generated API modules covering firewall, interfaces (VLAN, LAGG, VIP), HAProxy, DNSMasq, Caddy, WireGuard
- 9 OPNsense Scores: VlanScore, LaggScore, VipScore, DnatScore, FirewallRuleScore, OutboundNatScore, BinatScore, NodeExporterScore, OPNsenseShellCommandScore
- 13 opnsense-config modules with high-level Rust APIs
- E2E tests for DNSMasq CRUD, HAProxy service lifecycle, interface settings
- Brocade branch with VLAN CRUD, interface speed config, port-channel management
### Remaining
#### UpdateHostScore (new)
A Score that updates a host's configuration in the DHCP server and prepares it for PXE boot. Core responsibilities:
1. **Update MAC address in DHCP**: When hardware is replaced or NICs are swapped, update the DHCP static mapping with the new MAC address(es). This is the most critical function — without it, PXE boot targets the wrong hardware.
2. **Configure PXE boot options**: Set next-server, boot filename (BIOS/UEFI/iPXE) for the specific host.
3. **Host network setup for LAGG LACP 802.3ad**: Configure the host's network interfaces for link aggregation. This replaces the current `HostNetworkConfigurationScore` approach which only handles bond creation on the host side — the new approach must also create the corresponding LAGG interface on OPNsense and configure the Brocade switch port-channel with LACP.
The existing `DhcpHostBindingScore` handles bulk MAC-to-IP registration but lacks the ability to _update_ an existing mapping (the `remove_static_mapping` and `list_static_mappings` methods on `OPNSenseFirewall` are still `todo!()`).
#### Merge Brocade branch
The `feat/brocade-client-add-vlans` branch has breaking API changes:
- `configure_interfaces` now takes `Vec<InterfaceConfig>` instead of `Vec<(String, PortOperatingMode)>`
- `InterfaceType` changed from `Ethernet(String)` to specific variants (TenGigabitEthernet, FortyGigabitEthernet)
- `harmony/src/infra/brocade.rs` needs adaptation to the new API
#### HostNetworkConfigurationScore rework
The current implementation (`harmony/src/modules/okd/host_network.rs`) has documented limitations:
- Not idempotent (running twice may duplicate bond configs)
- No rollback logic
- Doesn't wait for switch config propagation
- All tests are `#[ignore]` due to requiring interactive TTY (inquire prompts)
- Doesn't create LAGG on OPNsense — only bonds on the host and port-channels on the switch
For LAGG LACP 802.3ad the flow needs to be:
1. Create LAGG interface on OPNsense (LaggScore already exists)
2. Create port-channel on Brocade switch (BrocadeSwitchConfigurationScore)
3. Configure bond on host via NMState (existing NetworkManager)
4. All three must be coordinated and idempotent
#### Fill remaining OPNsense `todo!()` stubs
- `OPNSenseFirewall::remove_static_mapping` — needed by UpdateHostScore
- `OPNSenseFirewall::list_static_mappings` — needed for idempotent updates
- `OPNSenseFirewall::Firewall` trait (add_rule, remove_rule, list_rules) — stub only
- `OPNSenseFirewall::dns::register_dhcp_leases` — stub only

View File

@@ -0,0 +1,56 @@
# Phase 8: HA OKD Production Deployment
## Goal
Deploy a production HAClusterTopology OKD cluster in UPI mode with full LAGG LACP 802.3ad, CARP VIP, multi-WAN, and BINAT for customer traffic — entirely automated through Harmony Scores.
## Status: Not Started
## Prerequisites
- Phase 7 (OPNsense & Bare-Metal) substantially complete
- Brocade branch merged and adapted
- UpdateHostScore implemented and tested
## Deployment Stack
### Network Layer (OPNsense)
- **LAGG interfaces** (802.3ad LACP) for all cluster hosts — redundant links via LaggScore
- **CARP VIPs** for high availability — failover IPs via VipScore
- **Multi-WAN** configuration — multiple uplinks with gateway groups
- **BINAT** for customer-facing IPs — 1:1 NAT via BinatScore
- **Firewall rules** per-customer with proper source/dest filtering via FirewallRuleScore
- **Outbound NAT** for cluster egress via OutboundNatScore
### Switch Layer (Brocade)
- **VLAN** per network segment (management, cluster, customer, storage)
- **Port-channels** (LACP) matching OPNsense LAGG interfaces
- **Interface speed** configuration for 10G/40G links
### Host Layer
- **PXE boot** via UpdateHostScore (MAC → DHCP → TFTP → iPXE → SCOS)
- **Network bonds** (LACP) via reworked HostNetworkConfigurationScore
- **NMState** for persistent bond configuration on OpenShift nodes
### Cluster Layer
- OKD UPI installation via existing OKDSetup01-04 Scores
- HAProxy load balancer for API and ingress via LoadBalancerScore
- DNS via OKDDnsScore
- Monitoring via NodeExporterScore + Prometheus stack
## New Scores Needed
1. **UpdateHostScore** — Update MAC in DHCP, configure PXE boot, prepare host network for LAGG LACP
2. **MultiWanScore** — Configure OPNsense gateway groups for multi-WAN failover
3. **CustomerBinatScore** (optional) — Higher-level Score combining BinatScore + FirewallRuleScore + DnatScore per customer
## Validation Checklist
- [ ] All hosts PXE boot successfully after MAC update
- [ ] LAGG/LACP active on all host links (verify via `teamdctl` or `nmcli`)
- [ ] CARP VIPs fail over within expected time window
- [ ] BINAT customers reachable from external networks
- [ ] Multi-WAN failover tested (pull one uplink, verify traffic shifts)
- [ ] Full OKD installation completes end-to-end
- [ ] Cluster API accessible via CARP VIP
- [ ] Customer workloads routable via BINAT

View File

@@ -0,0 +1,125 @@
# Phase 9: SSO + Config System Hardening
## Goal
Make the Zitadel + OpenBao SSO config management stack production-ready, well-tested, and reusable across deployments. The `harmony_sso` example demonstrates the full loop: deploy infrastructure, authenticate via SSO, store and retrieve config -- all in one `cargo run`.
## Current State (as of `feat/opnsense-codegen`)
The SSO example works end-to-end:
- k3d cluster + OpenBao + Zitadel deployed via Scores
- `OpenbaoSetupScore`: init, unseal, policies, userpass, JWT auth
- `ZitadelSetupScore`: project + device-code app provisioning via Management API (PAT auth)
- JWT exchange: Zitadel id_token → OpenBao client token via `/v1/auth/jwt/login`
- Device flow triggers in terminal, user logs in via browser, config stored in OpenBao KV v2
- CoreDNS patched for in-cluster hostname resolution (K3sFamily only)
- Discovery cache invalidation after CRD installation
- Session caching with TTL
### What's solid
- **Score composition**: 4 Scores orchestrate the full stack in ~280 lines
- **Config trait**: clean `Serialize + Deserialize + JsonSchema`, developer doesn't see OpenBao or Zitadel
- **Auth chain transparency**: token → cached → OIDC device flow → userpass, right thing happens
- **Idempotency**: all Scores safe to re-run, cached sessions skip login
### What needs work
See tasks below.
## Tasks
### 9.1 Builder pattern for `OpenbaoSecretStore` — HIGH
**Problem**: `OpenbaoSecretStore::new()` has 11 positional arguments. Adding JWT params made it worse. Callers pass `None, None, None, None` for unused options.
**Fix**: Replace with a builder:
```rust
OpenbaoSecretStore::builder()
.url("http://127.0.0.1:8200")
.kv_mount("secret")
.skip_tls(true)
.zitadel_sso("http://sso.harmony.local:8080", "client-id-123")
.jwt_auth("harmony-developer", "jwt")
.build()
.await?
```
**Impact**: All callers updated (lib.rs, openbao_chain example, harmony_sso example). Breaking API change.
**Files**: `harmony_secret/src/store/openbao.rs`, all callers
### 9.2 Fix ZitadelScore PG readiness — HIGH
**Problem**: `ZitadelScore` calls `topology.get_endpoint()` immediately after deploying the CNPG Cluster CR. The PG `-rw` service takes 15-30s to appear. This forces a retry loop in the caller (the example).
**Fix**: Add a wait loop inside `ZitadelScore`'s interpret, after `topology.deploy(&pg_config)`, that polls for the `-rw` service to exist before calling `get_endpoint()`. Use `K8sClient::get_resource::<Service>()` with a poll loop.
**Impact**: Eliminates the retry wrapper in the harmony_sso example and any other Zitadel consumer.
**Files**: `harmony/src/modules/zitadel/mod.rs`
### 9.3 `CoreDNSRewriteScore` — MEDIUM
**Problem**: CoreDNS patching logic lives in the harmony_sso example. It's a general pattern: any service with ingress-based Host routing needs in-cluster DNS resolution.
**Fix**: Extract into `harmony/src/modules/k8s/coredns.rs` as a proper Score:
```rust
pub struct CoreDNSRewriteScore {
pub rewrites: Vec<(String, String)>, // (hostname, service FQDN)
}
impl<T: Topology + K8sclient> Score<T> for CoreDNSRewriteScore { ... }
```
K3sFamily only. No-op on OpenShift. Idempotent.
**Files**: `harmony/src/modules/k8s/coredns.rs` (new), `harmony/src/modules/k8s/mod.rs`
### 9.4 Integration tests for Scores — MEDIUM
**Problem**: Zero tests for `OpenbaoSetupScore`, `ZitadelSetupScore`, `CoreDNSRewriteScore`. The Scores are testable against a running k3d cluster.
**Fix**: Add `#[ignore]` integration tests that require a running cluster:
- `test_openbao_setup_score`: deploy OpenBao + run setup, verify KV works
- `test_zitadel_setup_score`: deploy Zitadel + run setup, verify project/app exist
- `test_config_round_trip`: store + retrieve config via SSO-authenticated OpenBao
Run with `cargo test -- --ignored` after deploying the example.
**Files**: `harmony/tests/integration/` (new directory)
### 9.5 Remove `resolve()` DNS hack — LOW
**Problem**: `ZitadelOidcAuth::http_client()` hardcodes `resolve(host, 127.0.0.1:port)`. This only works for local k3d development.
**Fix**: Make it configurable. Add an optional `resolve_to: Option<SocketAddr>` field to `ZitadelOidcAuth`. The example passes `Some(127.0.0.1:8080)` for k3d; production passes `None` (uses real DNS). Or better: detect whether the host resolves and only apply the override if it doesn't.
**Files**: `harmony_secret/src/store/zitadel.rs`
### 9.6 Typed Zitadel API client — LOW
**Problem**: `ZitadelSetupScore` uses hand-written JSON with string parsing for Management API calls. No type safety on request/response.
**Fix**: Create typed request/response structs for the Management API v1 endpoints used (projects, apps, users). Use `serde` for serialization. This doesn't need to be a full API client -- just the endpoints we use.
**Files**: `harmony/src/modules/zitadel/api.rs` (new)
### 9.7 Capability traits for secret vault + identity — FUTURE
**Problem**: `OpenbaoScore` and `ZitadelScore` are tool-specific. No capability abstraction for "I need a secret vault" or "I need an identity provider".
**Fix**: Design `SecretVault` and `IdentityProvider` capability traits on topologies. This is a significant architectural decision that needs an ADR.
**Blocked by**: Real-world use of a second implementation (e.g., HashiCorp Vault, Keycloak) to validate the abstraction boundary.
### 9.8 Auto-unseal for OpenBao — FUTURE
**Problem**: Every pod restart requires manual unseal. `OpenbaoSetupScore` handles this, but requires re-running the Score.
**Fix**: Configure Transit auto-unseal (using a second OpenBao/Vault instance) or cloud KMS auto-unseal. This is an operational concern that should be configurable in `OpenbaoSetupScore`.
## Relationship to Other Phases
- **Phase 1** (config crate): SSO flow builds directly on `harmony_config` + `StoreSource<OpenbaoSecretStore>`. Phase 1 task 1.4 is now **complete** via the harmony_sso example.
- **Phase 2** (migrate to harmony_config): The 19 `SecretManager` call sites should migrate to `ConfigManager` with the OpenbaoSecretStore backend. The SSO flow validates this pattern works.
- **Phase 5** (E2E tests): The harmony_sso example is a candidate for the first E2E test -- it deploys k3d, exercises multiple Scores, and verifies config storage.

View File

@@ -0,0 +1,49 @@
# Phase 10: Firewall Pair Topology & HA Firewall Automation
## Goal
Provide first-class support for managing OPNsense (and future) HA firewall pairs through a higher-order topology, including CARP VIP orchestration, per-device config differentiation, and integration testing.
## Current State
`FirewallPairTopology` is implemented as a concrete wrapper around two `OPNSenseFirewall` instances. It applies uniform scores to both firewalls and differentiates CARP VIP advskew (primary=0, backup=configurable). All existing OPNsense scores (Lagg, Vlan, Firewall Rules, DNAT, BINAT, Outbound NAT, DHCP) work with the pair topology. QC1 uses it for its NT firewall pair.
## Tasks
### 10.1 Generic FirewallPair over a capability trait
**Priority**: MEDIUM
**Status**: Not started
`FirewallPairTopology` is currently concrete over `OPNSenseFirewall`. This breaks extensibility — a pfSense or VyOS firewall pair would need a separate type. Introduce a `FirewallAppliance` capability trait that `OPNSenseFirewall` implements, and make `FirewallPairTopology<T: FirewallAppliance>` generic. The blanket-impl pattern from ADR-015 then gives automatic pair support for any appliance type.
Key challenge: the trait needs to expose enough for `CarpVipScore` to configure VIPs with per-device advskew, without leaking OPNsense-specific APIs.
### 10.2 Delegation macro for higher-order topologies
**Priority**: MEDIUM
**Status**: Not started
The "delegate to both" pattern used by uniform pair scores is pure boilerplate. Every `Score<FirewallPairTopology>` impl for uniform scores follows the same structure: create the inner `Score<OPNSenseFirewall>` interpret, execute against primary, then backup.
Design a proc macro (e.g., `#[derive(DelegatePair)]` or `delegate_score_to_pair!`) that generates these impls automatically. This would also apply to `DecentralizedTopology` (delegate to all sites) and future higher-order topologies.
### 10.3 XMLRPC sync support
**Priority**: LOW
**Status**: Not started
Add optional `FirewallPairTopology::sync_from_primary()` that triggers OPNsense XMLRPC config sync from primary to backup. Useful for settings that must be identical and don't need per-device differentiation. Not blocking — independent application to both firewalls achieves the same config state.
### 10.4 Integration test with CARP/LACP failover
**Priority**: LOW
**Status**: Not started
Extend the existing OPNsense example deployment to create a firewall pair test fixture:
- Two OPNsense VMs in CARP configuration
- A third VM as a client verifying connectivity
- Automated failover testing: disconnect primary's virtual NIC, verify CARP failover to backup, reconnect, verify failback
- LACP failover: disconnect one LAGG member, verify traffic continues on remaining member
This builds on the KVM test harness from Phase 6.

View File

@@ -0,0 +1,77 @@
# Phase 11: Named Config Instances & Cross-Namespace Access
## Goal
Allow multiple instances of the same config type within a single namespace, identified by name. Also allow explicit namespace specification when retrieving config items, enabling cross-deployment orchestration.
## Context
The current `harmony_config` system identifies config items by type only (`T::KEY` from `#[derive(Config)]`). This works for singletons but breaks when you need multiple instances of the same type:
- **Firewall pair**: primary and backup need separate `OPNSenseApiCredentials` (different API keys for different devices)
- **Worker nodes**: each BMC has its own `IpmiCredentials` with different username/password
- **Firewall administrators**: multiple `OPNSenseApiCredentials` with different permission levels
- **Multi-tenant**: customer firewalls vs. NationTech infrastructure firewalls need separate credential sets
Using separate namespaces per device is not the answer — a firewall pair belongs to a single deployment, and forcing namespace switches for each device in a pair adds unnecessary friction.
Cross-namespace access is a separate but related need: the NT firewall pair and C1 customer firewall pair live in separate namespaces (the customer manages their own firewall), but NationTech needs read access to the C1 namespace for BINAT coordination.
## Tasks
### 11.1 Named config instances within a namespace
**Priority**: HIGH
**Status**: Not started
Extend the `Config` trait and `ConfigManager` to support an optional instance name:
```rust
// Current (singleton): gets "OPNSenseApiCredentials" from the active namespace
let creds = ConfigManager::get::<OPNSenseApiCredentials>().await?;
// New (named): gets "OPNSenseApiCredentials/fw-primary" from the active namespace
let primary_creds = ConfigManager::get_named::<OPNSenseApiCredentials>("fw-primary").await?;
let backup_creds = ConfigManager::get_named::<OPNSenseApiCredentials>("fw-backup").await?;
```
Storage key becomes `{T::KEY}/{instance_name}` (or similar). The unnamed `get()` remains unchanged for backward compatibility.
This needs to work across all config sources:
- `EnvSource`: `HARMONY_CONFIG_{KEY}_{NAME}` (e.g., `HARMONY_CONFIG_OPNSENSE_API_CREDENTIALS_FW_PRIMARY`)
- `SqliteSource`: composite key `{key}/{name}`
- `StoreSource` (OpenBao): path `{namespace}/{key}/{name}`
- `PromptSource`: prompt includes the instance name for clarity
### 11.2 Cross-namespace config access
**Priority**: MEDIUM
**Status**: Not started
Allow specifying an explicit namespace when retrieving a config item:
```rust
// Get from the active namespace (current behavior)
let nt_creds = ConfigManager::get::<OPNSenseApiCredentials>().await?;
// Get from a specific namespace
let c1_creds = ConfigManager::get_from_namespace::<OPNSenseApiCredentials>("c1").await?;
```
This enables orchestration across deployments: the NT deployment can read C1's firewall credentials for BINAT coordination without switching the global namespace.
For the `StoreSource` (OpenBao), this maps to reading from a different KV path prefix. For `SqliteSource`, it maps to a different database file or a namespace column. For `EnvSource`, it could use a different prefix (`HARMONY_CONFIG_C1_{KEY}`).
### 11.3 Update FirewallPairTopology to use named configs
**Priority**: MEDIUM
**Status**: Blocked by 11.1
Once named config instances are available, update `FirewallPairTopology::opnsense_from_config()` to use them:
```rust
let primary_creds = ConfigManager::get_named::<OPNSenseApiCredentials>("fw-primary").await?;
let backup_creds = ConfigManager::get_named::<OPNSenseApiCredentials>("fw-backup").await?;
```
This removes the current limitation of shared credentials between primary and backup.

View File

@@ -1,318 +0,0 @@
# Architecture Decision Record: Monitoring and Alerting Architecture
Initial Author: Willem Rolleman, Jean-Gabriel Carrier
Initial Date: March 9, 2026
Last Updated Date: March 9, 2026
## Status
Accepted
Supersedes: [ADR-010](010-monitoring-and-alerting.md)
## Context
Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:
1. **Cluster-level monitoring**: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
2. **Tenant-level monitoring**: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
3. **Application-level monitoring**: Developers deploying applications want zero-config monitoring that "just works" for their services.
The monitoring landscape is fragmented:
- **OKD/OpenShift**: Built-in Prometheus with AlertmanagerConfig CRDs
- **KubePrometheus**: Helm-based stack with PrometheusRule CRDs
- **RHOB (Red Hat Observability)**: Operator-based with MonitoringStack CRDs
- **Standalone Prometheus**: Raw Prometheus deployments
Each system has different CRDs, different installation methods, and different configuration APIs.
## Decision
We implement a **trait-based architecture with compile-time capability verification** that provides:
1. **Type-safe abstractions** via parameterized traits: `AlertReceiver<S>`, `AlertRule<S>`, `ScrapeTarget<S>`
2. **Compile-time topology compatibility** via the `Observability<S>` capability bound
3. **Three levels of abstraction**: Cluster, Tenant, and Application monitoring
4. **Pre-built alert rules** as functions that return typed structs
### Core Traits
```rust
// domain/topology/monitoring.rs
/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}
/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
-> Result<PreparationOutcome, PreparationError>;
async fn install_receivers(&self, sender: &S, inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
async fn install_rules(&self, sender: &S, inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
-> Result<...>;
}
```
### Alert Sender Types
Each monitoring stack is a distinct `AlertSender`:
| Sender | Module | Use Case |
|--------|--------|----------|
| `OpenshiftClusterAlertSender` | `monitoring/okd/` | OKD/OpenShift built-in monitoring |
| `KubePrometheus` | `monitoring/kube_prometheus/` | Helm-deployed kube-prometheus-stack |
| `Prometheus` | `monitoring/prometheus/` | Standalone Prometheus via Helm |
| `RedHatClusterObservability` | `monitoring/red_hat_cluster_observability/` | RHOB operator |
| `Grafana` | `monitoring/grafana/` | Grafana-managed alerting |
### Three Levels of Monitoring
#### 1. Cluster-Level Monitoring
For cluster administrators. Full control over monitoring infrastructure.
```rust
// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver { ... })],
rules: vec![Box::new(alert_rules)],
scrape_targets: Some(vec![Box::new(external_exporters)]),
}
```
**Characteristics:**
- Cluster-scoped CRDs and resources
- Can add external scrape targets (outside cluster)
- Manages Alertmanager configuration
- Requires cluster-admin privileges
#### 2. Tenant-Level Monitoring
For teams confined to namespaces. The topology determines tenant context.
```rust
// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender: &KubePrometheus, ...) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or("default");
// Install rules in tenant namespace
}
}
```
**Characteristics:**
- Namespace-scoped resources
- Cannot modify cluster-level monitoring config
- May have restricted receiver types
- Runtime validation of permissions (cannot be fully compile-time)
#### 3. Application-Level Monitoring
For developers. Zero-config, opinionated monitoring.
```rust
// modules/application/features/monitoring.rs
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}
impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// Auto-creates ServiceMonitor
// Auto-installs Ntfy for notifications
// Handles tenant namespace automatically
// Wires up sensible defaults
}
}
```
**Characteristics:**
- Automatic ServiceMonitor creation
- Opinionated notification channel (Ntfy)
- Tenant-aware via topology
- Minimal configuration required
## Rationale
### Why Generic Traits Instead of Unified Types?
Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:
```rust
// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }
// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }
// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }
```
A unified type would either:
1. Be a lowest-common-denominator (loses stack-specific features)
2. Be a complex union type (hard to use, easy to misconfigure)
Generic traits let each stack express its configuration naturally while providing a consistent interface.
### Why Compile-Time Capability Bounds?
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore { ... }
```
This fails at compile time if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.
### Why Not a MonitoringStack Abstraction (V2 Approach)?
The V2 approach proposed a unified `MonitoringStack` that hides sender selection:
```rust
// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
.add_alert_channel(discord)
```
**Problems:**
1. Hides which sender you're using, losing compile-time guarantees
2. "Version selection" actually chooses between fundamentally different systems
3. Would need to handle all stack-specific features through a generic interface
The current approach is explicit: you choose `OpenshiftClusterAlertSender` and the compiler verifies compatibility.
### Why Runtime Validation for Tenants?
Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.
Options considered:
1. **Compile-time tenant markers** - Would require modeling entire RBAC hierarchy in types. Over-engineering.
2. **Runtime validation** - Current approach. Fails with clear K8s permission errors if insufficient access.
3. **No tenant support** - Would exclude a major use case.
Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.
> Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.
## Consequences
### Pros
1. **Type Safety**: Invalid configurations are caught at compile time
2. **Extensibility**: Adding a new monitoring stack requires implementing traits, not modifying core code
3. **Clear Separation**: Cluster/Tenant/Application levels have distinct entry points
4. **Reusable Rules**: Pre-built alert rules as functions (`high_pvc_fill_rate_over_two_days()`)
5. **CRD Accuracy**: Type definitions match actual Kubernetes CRDs exactly
### Cons
1. **Implementation Explosion**: `DiscordReceiver` implements `AlertReceiver<S>` for each sender type (3+ implementations)
2. **Learning Curve**: Understanding the trait hierarchy takes time
3. **clone_box Boilerplate**: Required for trait object cloning (3 lines per impl)
### Mitigations
- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
- Learning curve is documented with examples at each level
- clone_box boilerplate is minimal and copy-paste
## Alternatives Considered
### Unified MonitoringStack Type
See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.
### Helm-Only Approach
Use `HelmScore` directly for each monitoring deployment. Rejected because:
- No type safety for alert rules
- Cannot compose with application features
- No tenant awareness
### Separate Modules Per Use Case
Have `cluster_monitoring/`, `tenant_monitoring/`, `app_monitoring/` as separate modules. Rejected because:
- Massive code duplication
- No shared abstraction for receivers/rules
- Adding a feature requires three implementations
## Implementation Notes
### Module Structure
```
modules/monitoring/
├── mod.rs # Public exports
├── alert_channel/ # Receivers (Discord, Webhook)
├── alert_rule/ # Rules and pre-built alerts
│ ├── prometheus_alert_rule.rs
│ └── alerts/ # Library of pre-built rules
│ ├── k8s/ # K8s-specific (pvc, pod, memory)
│ └── infra/ # Infrastructure (opnsense, dell)
├── okd/ # OpenshiftClusterAlertSender
├── kube_prometheus/ # KubePrometheus
├── prometheus/ # Prometheus
├── red_hat_cluster_observability/ # RHOB
├── grafana/ # Grafana
├── application_monitoring/ # Application-level scores
└── scrape_target/ # External scrape targets
```
### Adding a New Alert Sender
1. Create sender type: `pub struct MySender; impl AlertSender for MySender { ... }`
2. Implement `Observability<MySender>` for topologies that support it
3. Create CRD types in `crd/` subdirectory
4. Implement `AlertReceiver<MySender>` for existing receivers
5. Implement `AlertRule<MySender>` for `AlertManagerRuleGroup`
### Adding a New Alert Rule
```rust
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyAlert", "up == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "Service is down")
}
```
No trait implementation needed - `AlertManagerRuleGroup` already handles conversion.
## Related ADRs
- [ADR-013](013-monitoring-notifications.md): Notification channel selection (ntfy)
- [ADR-011](011-multi-tenant-cluster.md): Multi-tenant cluster architecture

View File

@@ -1,21 +0,0 @@
[package]
name = "example-monitoring-v2"
edition = "2024"
version.workspace = true
readme.workspace = true
license.workspace = true
[dependencies]
harmony = { path = "../../harmony" }
harmony_cli = { path = "../../harmony_cli" }
harmony-k8s = { path = "../../harmony-k8s" }
harmony_types = { path = "../../harmony_types" }
kube = { workspace = true }
schemars = "0.8"
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }
serde_yaml = { workspace = true }
url = { workspace = true }
log = { workspace = true }
async-trait = { workspace = true }
k8s-openapi = { workspace = true }

View File

@@ -1,91 +0,0 @@
# Monitoring v2 - Improved Architecture
This example demonstrates the improved monitoring architecture that addresses the "WTF/minute" issues in the original design.
## Key Improvements
### 1. **Single AlertChannel Trait with Generic Sender**
The original design required 9-12 implementations for each alert channel (Discord, Webhook, etc.) - one for each sender type. The new design uses a single trait with generic sender parameterization:
pub trait AlertChannel<Sender: AlertSender> {
async fn install_config(&self, sender: &Sender) -> Result<Outcome, InterpretError>;
fn name(&self) -> String;
fn as_any(&self) -> &dyn std::any::Any;
}
**Benefits:**
- One Discord implementation works with all sender types
- Type safety at compile time
- No runtime dispatch overhead
### 2. **MonitoringStack Abstraction**
Instead of manually selecting CRDPrometheus vs KubePrometheus vs RHOBObservability, you now have a unified MonitoringStack that handles versioning:
let monitoring_stack = MonitoringStack::new(MonitoringApiVersion::V2CRD)
.set_namespace("monitoring")
.add_alert_channel(discord_receiver)
.set_scrape_targets(vec![...]);
**Benefits:**
- Single source of truth for monitoring configuration
- Easy to switch between monitoring versions
- Automatic version-specific configuration
### 3. **TenantMonitoringScore - True Composition**
The original monitoring_with_tenant example just put tenant and monitoring as separate items in a vec. The new design truly composes them:
let tenant_score = TenantMonitoringScore::new("test-tenant", monitoring_stack);
This creates a single score that:
- Has tenant context
- Has monitoring configuration
- Automatically installs monitoring scoped to tenant namespace
**Benefits:**
- No more "two separate things" confusion
- Automatic tenant namespace scoping
- Clear ownership: tenant owns its monitoring
### 4. **Versioned Monitoring APIs**
Clear versioning makes it obvious which monitoring stack you're using:
pub enum MonitoringApiVersion {
V1Helm, // Old Helm charts
V2CRD, // Current CRDs
V3RHOB, // RHOB (future)
}
**Benefits:**
- No guessing which API version you're using
- Easy to migrate between versions
- Backward compatibility path
## Comparison
### Original Design (monitoring_with_tenant)
- Manual selection of each component
- Manual installation of both components
- Need to remember to pass both to harmony_cli::run
- Monitoring not scoped to tenant automatically
### New Design (monitoring_v2)
- Single composed score
- One score does it all
## Usage
cd examples/monitoring_v2
cargo run
## Migration Path
To migrate from the old design to the new:
1. Replace individual alert channel implementations with AlertChannel<Sender>
2. Use MonitoringStack instead of manual *Prometheus selection
3. Use TenantMonitoringScore instead of separate TenantScore + monitoring scores
4. Select monitoring version via MonitoringApiVersion

View File

@@ -1,343 +0,0 @@
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use log::debug;
use serde::{Deserialize, Serialize};
use serde_yaml::{Mapping, Value};
use harmony::data::Version;
use harmony::interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome};
use harmony::inventory::Inventory;
use harmony::score::Score;
use harmony::topology::{Topology, tenant::TenantManager};
use harmony_k8s::K8sClient;
use harmony_types::k8s_name::K8sName;
use harmony_types::net::Url;
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
fn namespace(&self) -> String;
}
#[derive(Debug)]
pub struct CRDPrometheus {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for CRDPrometheus {
fn name(&self) -> String {
"CRDPrometheus".to_string()
}
fn namespace(&self) -> String {
self.namespace.clone()
}
}
#[derive(Debug)]
pub struct RHOBObservability {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for RHOBObservability {
fn name(&self) -> String {
"RHOBObservability".to_string()
}
fn namespace(&self) -> String {
self.namespace.clone()
}
}
#[derive(Debug)]
pub struct KubePrometheus {
pub config: Arc<Mutex<KubePrometheusConfig>>,
}
impl Default for KubePrometheus {
fn default() -> Self {
Self::new()
}
}
impl KubePrometheus {
pub fn new() -> Self {
Self {
config: Arc::new(Mutex::new(KubePrometheusConfig::new())),
}
}
}
impl AlertSender for KubePrometheus {
fn name(&self) -> String {
"KubePrometheus".to_string()
}
fn namespace(&self) -> String {
self.config.lock().unwrap().namespace.clone().unwrap_or_else(|| "monitoring".to_string())
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct KubePrometheusConfig {
pub namespace: Option<String>,
#[serde(skip)]
pub alert_receiver_configs: Vec<AlertManagerChannelConfig>,
}
impl KubePrometheusConfig {
pub fn new() -> Self {
Self {
namespace: None,
alert_receiver_configs: Vec::new(),
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AlertManagerChannelConfig {
pub channel_receiver: serde_yaml::Value,
pub channel_route: serde_yaml::Value,
}
impl Default for AlertManagerChannelConfig {
fn default() -> Self {
Self {
channel_receiver: serde_yaml::Value::Mapping(Default::default()),
channel_route: serde_yaml::Value::Mapping(Default::default()),
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ScrapeTargetConfig {
pub service_name: String,
pub port: String,
pub path: String,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum MonitoringApiVersion {
V1Helm,
V2CRD,
V3RHOB,
}
#[derive(Debug, Clone)]
pub struct MonitoringStack {
pub version: MonitoringApiVersion,
pub namespace: String,
pub alert_channels: Vec<Arc<dyn AlertSender>>,
pub scrape_targets: Vec<ScrapeTargetConfig>,
}
impl MonitoringStack {
pub fn new(version: MonitoringApiVersion) -> Self {
Self {
version,
namespace: "monitoring".to_string(),
alert_channels: Vec::new(),
scrape_targets: Vec::new(),
}
}
pub fn set_namespace(mut self, namespace: &str) -> Self {
self.namespace = namespace.to_string();
self
}
pub fn add_alert_channel(mut self, channel: impl AlertSender + 'static) -> Self {
self.alert_channels.push(Arc::new(channel));
self
}
pub fn set_scrape_targets(mut self, targets: Vec<(&str, &str, String)>) -> Self {
self.scrape_targets = targets
.into_iter()
.map(|(name, port, path)| ScrapeTargetConfig {
service_name: name.to_string(),
port: port.to_string(),
path,
})
.collect();
self
}
}
pub trait AlertChannel<Sender: AlertSender> {
fn install_config(&self, sender: &Sender);
fn name(&self) -> String;
}
#[derive(Debug, Clone)]
pub struct DiscordWebhook {
pub name: K8sName,
pub url: Url,
pub selectors: Vec<HashMap<String, String>>,
}
impl DiscordWebhook {
fn get_config(&self) -> AlertManagerChannelConfig {
let mut route = Mapping::new();
route.insert(
Value::String("receiver".to_string()),
Value::String(self.name.to_string()),
);
route.insert(
Value::String("matchers".to_string()),
Value::Sequence(vec![Value::String("alertname!=Watchdog".to_string())]),
);
let mut receiver = Mapping::new();
receiver.insert(
Value::String("name".to_string()),
Value::String(self.name.to_string()),
);
let mut discord_config = Mapping::new();
discord_config.insert(
Value::String("webhook_url".to_string()),
Value::String(self.url.to_string()),
);
receiver.insert(
Value::String("discord_configs".to_string()),
Value::Sequence(vec![Value::Mapping(discord_config)]),
);
AlertManagerChannelConfig {
channel_receiver: Value::Mapping(receiver),
channel_route: Value::Mapping(route),
}
}
}
impl AlertChannel<CRDPrometheus> for DiscordWebhook {
fn install_config(&self, sender: &CRDPrometheus) {
debug!("Installing Discord webhook for CRDPrometheus in namespace: {}", sender.namespace());
debug!("Config: {:?}", self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
}
impl AlertChannel<RHOBObservability> for DiscordWebhook {
fn install_config(&self, sender: &RHOBObservability) {
debug!("Installing Discord webhook for RHOBObservability in namespace: {}", sender.namespace());
debug!("Config: {:?}", self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"webhook-receiver".to_string()
}
}
impl AlertChannel<KubePrometheus> for DiscordWebhook {
fn install_config(&self, sender: &KubePrometheus) {
debug!("Installing Discord webhook for KubePrometheus in namespace: {}", sender.namespace());
let config = sender.config.lock().unwrap();
let ns = config.namespace.clone().unwrap_or_else(|| "monitoring".to_string());
debug!("Namespace: {}", ns);
let mut config = sender.config.lock().unwrap();
config.alert_receiver_configs.push(self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
}
fn default_monitoring_stack() -> MonitoringStack {
MonitoringStack::new(MonitoringApiVersion::V2CRD)
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TenantMonitoringScore {
pub tenant_id: harmony_types::id::Id,
pub tenant_name: String,
#[serde(skip)]
#[serde(default = "default_monitoring_stack")]
pub monitoring_stack: MonitoringStack,
}
impl TenantMonitoringScore {
pub fn new(tenant_name: &str, monitoring_stack: MonitoringStack) -> Self {
Self {
tenant_id: harmony_types::id::Id::default(),
tenant_name: tenant_name.to_string(),
monitoring_stack,
}
}
}
impl<T: Topology + TenantManager> Score<T> for TenantMonitoringScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(TenantMonitoringInterpret {
score: self.clone(),
})
}
fn name(&self) -> String {
format!("{} monitoring [TenantMonitoringScore]", self.tenant_name)
}
}
#[derive(Debug)]
pub struct TenantMonitoringInterpret {
pub score: TenantMonitoringScore,
}
#[async_trait::async_trait]
impl<T: Topology + TenantManager> Interpret<T> for TenantMonitoringInterpret {
async fn execute(
&self,
_inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
let tenant_config = topology.get_tenant_config().await.unwrap();
let tenant_ns = tenant_config.name.clone();
match self.score.monitoring_stack.version {
MonitoringApiVersion::V1Helm => {
debug!("Installing Helm monitoring for tenant {}", tenant_ns);
}
MonitoringApiVersion::V2CRD => {
debug!("Installing CRD monitoring for tenant {}", tenant_ns);
}
MonitoringApiVersion::V3RHOB => {
debug!("Installing RHOB monitoring for tenant {}", tenant_ns);
}
}
Ok(Outcome::success(format!(
"Installed monitoring stack for tenant {} with version {:?}",
self.score.tenant_name,
self.score.monitoring_stack.version
)))
}
fn get_name(&self) -> InterpretName {
InterpretName::Custom("TenantMonitoringInterpret")
}
fn get_version(&self) -> Version {
Version::from("1.0.0").unwrap()
}
fn get_status(&self) -> InterpretStatus {
InterpretStatus::SUCCESS
}
fn get_children(&self) -> Vec<harmony_types::id::Id> {
Vec::new()
}
}

9
book.toml Normal file
View File

@@ -0,0 +1,9 @@
[book]
title = "Harmony"
description = "Infrastructure orchestration that treats your platform like first-class code"
src = "docs"
build-dir = "book"
authors = ["NationTech"]
[output.html]
mathjax-support = false

11
build/book.sh Executable file
View File

@@ -0,0 +1,11 @@
#!/bin/sh
set -e
cd "$(dirname "$0")/.."
cargo install mdbook --locked
mdbook build
test -f book/index.html || (echo "ERROR: book/index.html not found" && exit 1)
test -f book/concepts.html || (echo "ERROR: book/concepts.html not found" && exit 1)
test -f book/guides/getting-started.html || (echo "ERROR: book/guides/getting-started.html not found" && exit 1)

View File

@@ -1,6 +1,8 @@
#!/bin/sh
set -e
cd "$(dirname "$0")/.."
rustc --version
cargo check --all-targets --all-features --keep-going
cargo fmt --check

16
build/ci.sh Executable file
View File

@@ -0,0 +1,16 @@
#!/bin/sh
set -e
cd "$(dirname "$0")/.."
BRANCH="${1:-main}"
echo "=== Running CI for branch: $BRANCH ==="
echo "--- Checking code ---"
./build/check.sh
echo "--- Building book ---"
./build/book.sh
echo "=== CI passed ==="

View File

@@ -13,8 +13,9 @@ If you're new to Harmony, start here:
See how to use Harmony to solve real-world problems.
- [**OPNsense VM Integration**](./use-cases/opnsense-vm-integration.md): Boot a real OPNsense firewall in a local KVM VM and configure it entirely through Harmony. Fully automated, zero manual steps — the flashiest demo. Requires Linux with KVM.
- [**PostgreSQL on Local K3D**](./use-cases/postgresql-on-local-k3d.md): Deploy a production-grade PostgreSQL cluster on a local K3D cluster. The fastest way to get started.
- [**OKD on Bare Metal**](./use-cases/okd-on-bare-metal.md): A detailed walkthrough of bootstrapping a high-availability OKD cluster from physical hardware.
- [**Deploy a Rust Web App**](./use-cases/deploy-rust-webapp.md): A quick guide to deploying a monitored, containerized web application to a Kubernetes cluster.
## 3. Component Catalogs
@@ -31,16 +32,7 @@ Ready to build your own components? These guides show you how.
- [**Writing a Score**](./guides/writing-a-score.md): Learn how to create your own `Score` and `Interpret` logic to define a new desired state.
- [**Writing a Topology**](./guides/writing-a-topology.md): Learn how to model a new environment (like AWS, GCP, or custom hardware) as a `Topology`.
- [**Adding Capabilities**](./guides/adding-capabilities.md): See how to add a `Capability` to your custom `Topology`.
- [**Coding Guide**](./coding-guide.md): Conventions and best practices for writing Harmony code.
## 5. Module Documentation
## 5. Architecture Decision Records
Deep dives into specific Harmony modules and features.
- [**Monitoring and Alerting**](./monitoring.md): Comprehensive guide to cluster, tenant, and application-level monitoring with support for OKD, KubePrometheus, RHOB, and more.
## 6. Architecture Decision Records
Important architectural decisions are documented in the `adr/` directory:
- [Full ADR Index](../adr/)
Harmony's design is documented through Architecture Decision Records (ADRs). See the [ADR Overview](./adr/README.md) for a complete index of all decisions.

54
docs/SUMMARY.md Normal file
View File

@@ -0,0 +1,54 @@
# Summary
[Harmony Documentation](./README.md)
- [Core Concepts](./concepts.md)
- [Getting Started Guide](./guides/getting-started.md)
## Use Cases
- [PostgreSQL on Local K3D](./use-cases/postgresql-on-local-k3d.md)
- [OPNsense VM Integration](./use-cases/opnsense-vm-integration.md)
- [OKD on Bare Metal](./use-cases/okd-on-bare-metal.md)
## Component Catalogs
- [Scores Catalog](./catalogs/scores.md)
- [Topologies Catalog](./catalogs/topologies.md)
- [Capabilities Catalog](./catalogs/capabilities.md)
## Developer Guides
- [Developer Guide](./guides/developer-guide.md)
- [Writing a Score](./guides/writing-a-score.md)
- [Writing a Topology](./guides/writing-a-topology.md)
- [Adding Capabilities](./guides/adding-capabilities.md)
## Configuration
- [Configuration](./concepts/configuration.md)
## Architecture Decision Records
- [ADR Overview](./adr/README.md)
- [000 · ADR Template](./adr/000-ADR-Template.md)
- [001 · Why Rust](./adr/001-rust.md)
- [002 · Hexagonal Architecture](./adr/002-hexagonal-architecture.md)
- [003 · Infrastructure Abstractions](./adr/003-infrastructure-abstractions.md)
- [004 · iPXE](./adr/004-ipxe.md)
- [005 · Interactive Project](./adr/005-interactive-project.md)
- [006 · Secret Management](./adr/006-secret-management.md)
- [007 · Default Runtime](./adr/007-default-runtime.md)
- [008 · Score Display Formatting](./adr/008-score-display-formatting.md)
- [009 · Helm and Kustomize Handling](./adr/009-helm-and-kustomize-handling.md)
- [010 · Monitoring and Alerting](./adr/010-monitoring-and-alerting.md)
- [011 · Multi-Tenant Cluster](./adr/011-multi-tenant-cluster.md)
- [012 · Project Delivery Automation](./adr/012-project-delivery-automation.md)
- [013 · Monitoring Notifications](./adr/013-monitoring-notifications.md)
- [015 · Higher Order Topologies](./adr/015-higher-order-topologies.md)
- [016 · Harmony Agent and Global Mesh](./adr/016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md)
- [017-1 · NATS Clusters Interconnection](./adr/017-1-Nats-Clusters-Interconnection-Topology.md)
- [018 · Template Hydration for Workload Deployment](./adr/018-Template-Hydration-For-Workload-Deployment.md)
- [019 · Network Bond Setup](./adr/019-Network-bond-setup.md)
- [020 · Interactive Configuration Crate](./adr/020-interactive-configuration-crate.md)
- [020-1 · Zitadel + OpenBao Secure Config Store](./adr/020-1-zitadel-openbao-secure-config-store.md)

View File

@@ -2,7 +2,7 @@
## Status
Proposed
Rejected : See ADR 020 ./020-interactive-configuration-crate.md
### TODO [#3](https://git.nationtech.io/NationTech/harmony/issues/3):

View File

@@ -0,0 +1,238 @@
Here are some rough notes on the previous design :
- We found an issue where there could be primary flapping when network latency is larger than the primary self fencing timeout.
- e.g. network latency to get nats ack is 30 seconds (extreme but can happen), and self-fencing happens after 50 seconds. Then at second 50 self-fencing would occur, and then at second 60 ack comes in. At this point we reject the ack as already failed because of timeout. Self fencing happens. But then network latency comes back down to 5 seconds and lets one successful heartbeat through, this means the primary comes back to healthy, and the same thing repeats, so the primary flaps.
- At least this does not cause split brain since the replica never times out and wins the leadership write since we validate strict write ordering and we force consensus on writes.
Also, we were seeing that the implementation became more complex. There is a lot of timers to handle and that becomes hard to reason about for edge cases.
So, we came up with a slightly different approach, inspired by k8s liveness probes.
We now want to use a failure and success threshold counter . However, on the replica side, all we can do is use a timer. The timer we can use is time since last primary heartbeat jetstream metadata timestamp. We could also try and mitigate clock skew by measuring time between internal clock and jetstream metadata timestamp when writing our own heartbeat (not for now, but worth thinking about, though I feel like it is useless).
So the current working design is this :
configure :
- number of consecutive success to mark the node as UP
- number of consecutive failures to mark the node as DOWN
- note that success/failure must be consecutive. One success in a row of failures is enough to keep service up. This allows for various configuration profiles, from very stict availability to very lenient depending on the number of failure tolerated and success required to keep the service up.
- failure_threshold at 100 will let a service fail (or timeout) 99/100 and stay up
- success_threshold at 100 will not bring back up a service until it has succeeded 100 heartbeat in a row
- failure threshold at 1 will fail the service at the slightest network latency spike/packet loss
- success threshold at 1 will bring the service up very quickly and may cause flapping in unstable network conditions
```
# heartbeat session log
# failure threshold : 3
# success threshold : 2
STATUS UP :
t=1 probe : fail f=1 s=0
t=2 probe : fail : f=2 s=0
t=3 probe : ok f=0 s=1
t=4 probe : fail f=1 s=0
```
Scenario :
failure threshold = 2
heartbeat timeout = 1s
total before fencing = 2 * 1 = 2s
staleness detection timer = 2*total before fencing
can we do this simple multiplication that staleness detection timer (time the replica waits since the last primary heartbeat before promoting itself) is double the time the replica will take before starting the fencing process.
---
### Context
We are designing a **Staleness-Based Failover Algorithm** for the Harmony Agent. The goal is to manage High Availability (HA) for stateful workloads (like PostgreSQL) across decentralized, variable-quality networks ("Micro Data Centers").
We are moving away from complex, synchronized clocks in favor of a **Counter-Based Liveness** approach (inspired by Kubernetes probes) for the Primary, and a **Time-Based Watchdog** for the Replica.
### 1. The Algorithm
#### The Primary (Self-Health & Fencing)
The Primary validates its own "License to Operate" via a heartbeat loop.
* **Loop:** Every `heartbeat_interval` (e.g., 1s), it attempts to write a heartbeat to NATS and check the local DB.
* **Counters:** It maintains `consecutive_failures` and `consecutive_successes`.
* **State Transition:**
* **To UNHEALTHY:** If `consecutive_failures >= failure_threshold`, the Primary **Fences Self** (stops DB, releases locks).
* **To HEALTHY:** If `consecutive_successes >= success_threshold`, the Primary **Un-fences** (starts DB, acquires locks).
* **Reset Logic:** A single success resets the failure counter to 0, and vice versa.
#### The Replica (Staleness Detection)
The Replica acts as a passive watchdog observing the NATS stream.
* **Calculation:** It calculates a `MaxStaleness` timeout.
$$ \text{MaxStaleness} = (\text{failure\_threshold} \times \text{heartbeat\_interval}) \times \text{SafetyMultiplier} $$
*(We use a SafetyMultiplier of 2 to ensure the Primary has definitely fenced itself before we take over).*
* **Action:** If `Time.now() - LastPrimaryHeartbeat > MaxStaleness`, the Replica assumes the Primary is dead and **Promotes Self**.
---
### 2. Configuration Trade-offs
The separation of `success` and `failure` thresholds allows us to tune the "personality" of the cluster.
#### Scenario A: The "Nervous" Cluster (High Sensitivity)
* **Config:** `failure_threshold: 1`, `success_threshold: 1`
* **Behavior:** Fails over immediately upon a single missed packet or slow disk write.
* **Pros:** Maximum availability for perfect networks.
* **Cons:** **High Flapping Risk.** In a residential network, a microwave turning on might cause a failover.
#### Scenario B: The "Tank" Cluster (High Stability)
* **Config:** `failure_threshold: 10`, `success_threshold: 1`
* **Behavior:** The node must be consistently broken for 10 seconds (assuming 1s interval) to give up.
* **Pros:** Extremely stable on bad networks (e.g., Starlink, 4G). Ignores transient spikes.
* **Cons:** **Slow Failover.** Users experience 10+ seconds of downtime before the Replica even *thinks* about taking over.
#### Scenario C: The "Sticky" Cluster (Hysteresis)
* **Config:** `failure_threshold: 5`, `success_threshold: 5`
* **Behavior:** Hard to kill, hard to bring back.
* **Pros:** Prevents "Yo-Yo" effects. If a node fails, it must prove it is *really* stable (5 clean checks in a row) before re-joining the cluster.
---
### 3. Failure Modes & Behavior Analysis
Here is how the algorithm handles specific edge cases:
#### Case 1: Immediate Outage (Power Cut / Kernel Panic)
* **Event:** Primary vanishes instantly. No more writes to NATS.
* **Primary:** Does nothing (it's dead).
* **Replica:** Sees the `LastPrimaryHeartbeat` timestamp age. Once it crosses `MaxStaleness`, it promotes itself.
* **Outcome:** Clean failover after the timeout duration.
#### Case 2: Network Instability (Packet Loss / Jitter)
* **Event:** The Primary fails to write to NATS for 2 cycles due to Wi-Fi interference, then succeeds on the 3rd.
* **Config:** `failure_threshold: 5`.
* **Primary:**
* $t=1$: Fail (Counter=1)
* $t=2$: Fail (Counter=2)
* $t=3$: Success (Counter resets to 0). **State remains HEALTHY.**
* **Replica:** Sees a gap in heartbeats but the timestamp never exceeds `MaxStaleness`.
* **Outcome:** No downtime, no failover. The system correctly identified this as noise, not failure.
#### Case 3: High Latency (The "Slow Death")
* **Event:** Primary is under heavy load; heartbeats take 1.5s to complete (interval is 1s).
* **Primary:** The `timeout` on the heartbeat logic triggers. `consecutive_failures` rises. Eventually, it hits `failure_threshold` and fences itself to prevent data corruption.
* **Replica:** Sees the heartbeats stop (or arrive too late). The timestamp ages out.
* **Outcome:** Primary fences self -> Replica waits for safety buffer -> Replica promotes. **Split-brain is avoided** because the Primary killed itself *before* the Replica acted (due to the SafetyMultiplier).
#### Case 4: Replica Network Partition
* **Event:** Replica loses internet connection; Primary is fine.
* **Replica:** Sees `LastPrimaryHeartbeat` age out (because it can't reach NATS). It *wants* to promote itself.
* **Constraint:** To promote, the Replica must write to NATS. Since it is partitioned, the NATS write fails.
* **Outcome:** The Replica remains in Standby (or fails to promote). The Primary continues serving traffic. **Cluster integrity is preserved.**
----
### Context & Use Case
We are implementing a High Availability (HA) Failover Strategy for decentralized "Micro Data Centers." The core challenge is managing stateful workloads (PostgreSQL) over unreliable networks.
We solve this using a **Local Fencing First** approach, backed by **NATS JetStream Strict Ordering** for the final promotion authority.
In CAP theorem terms, we are developing a CP system, intentionally sacrificing availability. In practical terms, we expect an average of two primary outages per year, with a failover delay of around 2 minutes. This translates to an uptime of over five nines. To be precise, 2 outages * 2 minutes = 4 minutes per year = 99.99924% uptime.
### The Algorithm: Local Fencing & Remote Promotion
The safety (data consistency) of the system relies on the time gap between the **Primary giving up (Fencing)** and the **Replica taking over (Promotion)**.
To avoid clock skew issues between agents and datastore (nats), all timestamps comparisons will be done using jetstream metadata. I.E. a harmony agent will never use `Instant::now()` to get a timestamp, it will use `my_last_heartbeat.metadata.timestamp` (conceptually).
#### 1. Configuration
* `heartbeat_timeout` (e.g., 1s): Max time allowed for a NATS write/DB check.
* `failure_threshold` (e.g., 2): Consecutive failures before self-fencing.
* `failover_timeout` (e.g., 5s): Time since last NATS update of Primary heartbeat before Replica promotes.
* This timeout must be carefully configured to allow enough time for the primary to fence itself (after `heartbeat_timeout * failure_threshold`) BEFORE the replica gets promoted to avoid a split brain with two primaries.
* Implementing this will rely on the actual deployment configuration. For example, a CNPG based PostgreSQL cluster might require a longer gap (such as 30s) than other technologies.
* Expires when `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
#### 2. The Primary (Self-Preservation)
The Primary is aggressive about killing itself.
* It attempts a heartbeat.
* If the network latency > `heartbeat_timeout`, the attempt is **cancelled locally** because the heartbeat did not make it back in time.
* This counts as a failure and increments the `consecutive_failures` counter.
* If `consecutive_failures` hit the threshold, **FENCING occurs immediately**. The database is stopped.
This means that the Primary will fence itself after `heartbeat_timeout * failure_threshold`.
#### 3. The Replica (The Watchdog)
The Replica is patient.
* It watches the NATS stream to measure if `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
* It only attempts promotion if the `failover_timeout` (5s) has passed.
* **Crucial:** Careful configuration of the failover_timeout is required. This is the only way to avoid a split brain in case of a network partition where the Primary cannot write its heartbeats in time anymore.
* In short, `failover_timeout` should be tuned to be `heartbeat_timeout * failure_threshold + safety_margin`. This `safety_margin` will vary by use case. For example, a CNPG cluster may need 30 seconds to demote a Primary to Replica when fencing is triggered, so `safety_margin` should be at least 30s in that setup.
Since we forcibly fail timeouts after `heartbeat_timeout`, we are guaranteed that the primary will have **started** the fencing process after `heartbeat_timeout * failure_threshold`.
But, in a network split scenario where the failed primary is still accessible by clients but cannot write its heartbeat successfully, there is no way to know if the demotion has actually **completed**.
For example, in a CNPG cluster, the failed Primary agent will attempt to change the CNPG cluster state to read-only. But if anything fails after that attempt (permission error, k8s api failure, CNPG bug, etc) it is possible that the PostgreSQL instance keeps accepting writes.
While this is not a theoretical failure of the agent's algorithm, this is a practical failure where data corruption occurs.
This can be fixed by detecting the demotion failure and escalating the fencing procedure aggressiveness. Harmony being an infrastructure orchestrator, it can easily exert radical measures if given the proper credentials, such as forcibly powering off a server, disconnecting its network in the switch configuration, forcibly kill a pod/container/process, etc.
However, these details are out of scope of this algorithm, as they simply fall under the "fencing procedure".
The implementation of the fencing procedure itself is not relevant. This algorithm's responsibility stops at calling the fencing procedure in the appropriate situation.
#### 4. The Demotion Handshake (Return to Normalcy)
When the original Primary recovers:
1. It becomes healthy locally but sees `current_primary = Replica`. It waits.
2. The Replica (current leader) detects the Original Primary is back (via NATS heartbeats).
3. Replica performs a **Clean Demotion**:
* Stops DB.
* Writes `current_primary = None` to NATS.
4. Original Primary sees `current_primary = None` and can launch the promotion procedure.
Depending on the implementation, the promotion procedure may require a transition phase. Typically, for a PostgreSQL use case the promoting primary will make sure it has caught up on WAL replication before starting to accept writes.
---
### Failure Modes & Behavior Analysis
#### Case 1: Immediate Outage (Power Cut)
* **Primary:** Dies instantly. Fencing is implicit (machine is off).
* **Replica:** Waits for `failover_timeout` (5s). Sees staleness. Promotes self.
* **Outcome:** Clean failover after 5s.
// TODO detail what happens when the primary comes back up. We will likely have to tie PostgreSQL's lifecycle (liveness/readiness probes) with the agent to ensure it does not come back up as primary.
#### Case 2: High Network Latency on the Primary (The "Split Brain" Trap)
* **Scenario:** Network latency spikes to 5s on the Primary, still below `heartbeat_timeout` on the Replica.
* **T=0 to T=2 (Primary):** Tries to write. Latency (5s) > Timeout (1s). Fails twice.
* **T=2 (Primary):** `consecutive_failures` = 2. **Primary Fences Self.** (Service is DOWN).
* **T=2 to T=5 (Cluster):** **Read-Only Phase.** No Primary exists.
* **T=5 (Replica):** `failover_timeout` reached. Replica promotes self.
* **Outcome:** Safe failover. The "Read-Only Gap" (T=2 to T=5) ensures no Split Brain occurred.
#### Case 3: Replica Network Lag (False Positive)
* **Scenario:** Replica has high latency, greater than `failover_timeout`; Primary is fine.
* **Replica:** Thinks Primary is dead. Tries to promote by setting `cluster_state.current_primary = replica_id`.
* **NATS:** Rejects the write because the Primary is still updating the sequence numbers successfully.
* **Outcome:** Promotion denied. Primary stays leader.
#### Case 4: Network Instability (Flapping)
* **Scenario:** Intermittent packet loss.
* **Primary:** Fails 1 heartbeat, succeeds the next. `consecutive_failures` resets.
* **Replica:** Sees a slight delay in updates, but never reaches `failover_timeout`.
* **Outcome:** No Fencing, No Promotion. System rides out the noise.
## Contextual notes
* Clock skew : Tokio relies on monotonic clocks. This means that `tokio::time::sleep(...)` will not be affected by system clock corrections (such as NTP). But monotonic clocks are known to jump forward in some cases such as VM live migrations. This could mean a false timeout of a single heartbeat. If `failure_threshold = 1`, this can mean a false negative on the nodes' health, and a potentially useless demotion.

View File

@@ -0,0 +1,107 @@
### Context & Use Case
We are implementing a High Availability (HA) Failover Strategy for decentralized "Micro Data Centers." The core challenge is managing stateful workloads (PostgreSQL) over unreliable networks.
We solve this using a **Local Fencing First** approach, backed by **NATS JetStream Strict Ordering** for the final promotion authority.
In CAP theorem terms, we are developing a CP system, intentionally sacrificing availability. In practical terms, we expect an average of two primary outages per year, with a failover delay of around 2 minutes. This translates to an uptime of over five nines. To be precise, 2 outages * 2 minutes = 4 minutes per year = 99.99924% uptime.
### The Algorithm: Local Fencing & Remote Promotion
The safety (data consistency) of the system relies on the time gap between the **Primary giving up (Fencing)** and the **Replica taking over (Promotion)**.
To avoid clock skew issues between agents and datastore (nats), all timestamps comparisons will be done using jetstream metadata. I.E. a harmony agent will never use `Instant::now()` to get a timestamp, it will use `my_last_heartbeat.metadata.timestamp` (conceptually).
#### 1. Configuration
* `heartbeat_timeout` (e.g., 1s): Max time allowed for a NATS write/DB check.
* `failure_threshold` (e.g., 2): Consecutive failures before self-fencing.
* `failover_timeout` (e.g., 5s): Time since last NATS update of Primary heartbeat before Replica promotes.
* This timeout must be carefully configured to allow enough time for the primary to fence itself (after `heartbeat_timeout * failure_threshold`) BEFORE the replica gets promoted to avoid a split brain with two primaries.
* Implementing this will rely on the actual deployment configuration. For example, a CNPG based PostgreSQL cluster might require a longer gap (such as 30s) than other technologies.
* Expires when `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
#### 2. The Primary (Self-Preservation)
The Primary is aggressive about killing itself.
* It attempts a heartbeat.
* If the network latency > `heartbeat_timeout`, the attempt is **cancelled locally** because the heartbeat did not make it back in time.
* This counts as a failure and increments the `consecutive_failures` counter.
* If `consecutive_failures` hit the threshold, **FENCING occurs immediately**. The database is stopped.
This means that the Primary will fence itself after `heartbeat_timeout * failure_threshold`.
#### 3. The Replica (The Watchdog)
The Replica is patient.
* It watches the NATS stream to measure if `replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout`
* It only attempts promotion if the `failover_timeout` (5s) has passed.
* **Crucial:** Careful configuration of the failover_timeout is required. This is the only way to avoid a split brain in case of a network partition where the Primary cannot write its heartbeats in time anymore.
* In short, `failover_timeout` should be tuned to be `heartbeat_timeout * failure_threshold + safety_margin`. This `safety_margin` will vary by use case. For example, a CNPG cluster may need 30 seconds to demote a Primary to Replica when fencing is triggered, so `safety_margin` should be at least 30s in that setup.
Since we forcibly fail timeouts after `heartbeat_timeout`, we are guaranteed that the primary will have **started** the fencing process after `heartbeat_timeout * failure_threshold`.
But, in a network split scenario where the failed primary is still accessible by clients but cannot write its heartbeat successfully, there is no way to know if the demotion has actually **completed**.
For example, in a CNPG cluster, the failed Primary agent will attempt to change the CNPG cluster state to read-only. But if anything fails after that attempt (permission error, k8s api failure, CNPG bug, etc) it is possible that the PostgreSQL instance keeps accepting writes.
While this is not a theoretical failure of the agent's algorithm, this is a practical failure where data corruption occurs.
This can be fixed by detecting the demotion failure and escalating the fencing procedure aggressiveness. Harmony being an infrastructure orchestrator, it can easily exert radical measures if given the proper credentials, such as forcibly powering off a server, disconnecting its network in the switch configuration, forcibly kill a pod/container/process, etc.
However, these details are out of scope of this algorithm, as they simply fall under the "fencing procedure".
The implementation of the fencing procedure itself is not relevant. This algorithm's responsibility stops at calling the fencing procedure in the appropriate situation.
#### 4. The Demotion Handshake (Return to Normalcy)
When the original Primary recovers:
1. It becomes healthy locally but sees `current_primary = Replica`. It waits.
2. The Replica (current leader) detects the Original Primary is back (via NATS heartbeats).
3. Replica performs a **Clean Demotion**:
* Stops DB.
* Writes `current_primary = None` to NATS.
4. Original Primary sees `current_primary = None` and can launch the promotion procedure.
Depending on the implementation, the promotion procedure may require a transition phase. Typically, for a PostgreSQL use case the promoting primary will make sure it has caught up on WAL replication before starting to accept writes.
---
### Failure Modes & Behavior Analysis
#### Case 1: Immediate Outage (Power Cut)
* **Primary:** Dies instantly. Fencing is implicit (machine is off).
* **Replica:** Waits for `failover_timeout` (5s). Sees staleness. Promotes self.
* **Outcome:** Clean failover after 5s.
// TODO detail what happens when the primary comes back up. We will likely have to tie PostgreSQL's lifecycle (liveness/readiness probes) with the agent to ensure it does not come back up as primary.
#### Case 2: High Network Latency on the Primary (The "Split Brain" Trap)
* **Scenario:** Network latency spikes to 5s on the Primary, still below `heartbeat_timeout` on the Replica.
* **T=0 to T=2 (Primary):** Tries to write. Latency (5s) > Timeout (1s). Fails twice.
* **T=2 (Primary):** `consecutive_failures` = 2. **Primary Fences Self.** (Service is DOWN).
* **T=2 to T=5 (Cluster):** **Read-Only Phase.** No Primary exists.
* **T=5 (Replica):** `failover_timeout` reached. Replica promotes self.
* **Outcome:** Safe failover. The "Read-Only Gap" (T=2 to T=5) ensures no Split Brain occurred.
#### Case 3: Replica Network Lag (False Positive)
* **Scenario:** Replica has high latency, greater than `failover_timeout`; Primary is fine.
* **Replica:** Thinks Primary is dead. Tries to promote by setting `cluster_state.current_primary = replica_id`.
* **NATS:** Rejects the write because the Primary is still updating the sequence numbers successfully.
* **Outcome:** Promotion denied. Primary stays leader.
#### Case 4: Network Instability (Flapping)
* **Scenario:** Intermittent packet loss.
* **Primary:** Fails 1 heartbeat, succeeds the next. `consecutive_failures` resets.
* **Replica:** Sees a slight delay in updates, but never reaches `failover_timeout`.
* **Outcome:** No Fencing, No Promotion. System rides out the noise.
## Contextual notes
* Clock skew : Tokio relies on monotonic clocks. This means that `tokio::time::sleep(...)` will not be affected by system clock corrections (such as NTP). But monotonic clocks are known to jump forward in some cases such as VM live migrations. This could mean a false timeout of a single heartbeat. If `failure_threshold = 1`, this can mean a false negative on the nodes' health, and a potentially useless demotion.
* `heartbeat_timeout == heartbeat_interval` : We intentionally do not provide two separate settings for the timeout before considering a heartbeat failed and the interval between heartbeats. It could make sense in some configurations where low network latency is required to have a small `heartbeat_timeout = 50ms` and larger `hartbeat_interval == 2s`, but we do not have a practical use case for it yet. And having timeout larger than interval does not make sense in any situation we can think of at the moment. So we decided to have a single value for both, which makes the algorithm easier to reason about and implement.

View File

@@ -0,0 +1,95 @@
# Architecture Decision Record: Staleness-Based Failover Mechanism & Observability
**Status:** Proposed
**Date:** 2026-01-09
**Precedes:** [016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md](https://git.nationtech.io/NationTech/harmony/raw/branch/master/adr/016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md)
## Context
In ADR 016, we established the **Harmony Agent** and the **Global Orchestration Mesh** (powered by NATS JetStream) as the foundation for our decentralized infrastructure. We defined the high-level need for a `FailoverStrategy` that can support both financial consistency (CP) and AI availability (AP).
However, a specific implementation challenge remains: **How do we reliably detect node failure without losing the ability to debug the event later?**
Standard distributed systems often use "Key Expiration" (TTL) for heartbeats. If a key disappears, the node is presumed dead. While simple, this approach is catastrophic for post-mortem analysis. When the key expires, the evidence of *when* and *how* the failure occurred evaporates.
For NationTechs vision of **Humane Computing**—where micro datacenters might be heating a family home or running a local business—reliability and diagnosability are paramount. If a cluster fails over, we owe it to the user to provide a clear, historical log of exactly what happened. We cannot build a "wonderful future for computers" on ephemeral, untraceable errors.
## Decision
We will implement a **Staleness Detection** mechanism rather than a Key Expiration mechanism. We will leverage NATS JetStream Key-Value (KV) stores with **History Enabled** to create an immutable audit trail of cluster health.
### 1. The "Black Box" Flight Recorder (NATS Configuration)
We will utilize a persistent NATS KV bucket named `harmony_failover`.
* **Storage:** File (Persistent).
* **History:** Set to `64` (or higher). This allows us to query the last 64 heartbeat entries to visualize the exact degradation of the primary node before failure.
* **TTL:** None. Data never disappears; it only becomes "stale."
### 2. Data Structures
We will define two primary schemas to manage the state.
**A. The Rules of Engagement (`cluster_config`)**
This persistent key defines the behavior of the mesh. It allows us to tune failover sensitivity dynamically without redeploying the Agent binary.
```json
{
"primary_site_id": "site-a-basement",
"replica_site_id": "site-b-cloud",
"failover_timeout_ms": 5000, // Time before Replica takes over
"heartbeat_interval_ms": 1000 // Frequency of Primary updates
}
```
> **Note :** The location for this configuration data structure is TBD. See https://git.nationtech.io/NationTech/harmony/issues/206
**B. The Heartbeat (`primary_heartbeat`)**
The Primary writes this; the Replica watches it.
```json
{
"site_id": "site-a-basement",
"status": "HEALTHY",
"counter": 10452,
"timestamp": 1704661549000
}
```
### 3. The Failover Algorithm
**The Primary (Site A) Logic:**
The Primary's ability to write to the mesh is its "License to Operate."
1. **Write Loop:** Attempts to write `primary_heartbeat` every `heartbeat_interval_ms`.
2. **Self-Preservation (Fencing):** If the write fails (NATS Ack timeout or NATS unreachable), the Primary **immediately self-demotes**. It assumes it is network-isolated. This prevents Split Brain scenarios where a partitioned Primary continues to accept writes while the Replica promotes itself.
**The Replica (Site B) Logic:**
The Replica acts as the watchdog.
1. **Watch:** Subscribes to updates on `primary_heartbeat`.
2. **Staleness Check:** Maintains a local timer. Every time a heartbeat arrives, the timer resets.
3. **Promotion:** If the timer exceeds `failover_timeout_ms`, the Replica declares the Primary dead and promotes itself to Leader.
4. **Yielding:** If the Replica is Leader, but suddenly receives a valid, new heartbeat from the configured `primary_site_id` (indicating the Primary has recovered), the Replica will voluntarily **demote** itself to restore the preferred topology.
## Rationale
**Observability as a First-Class Citizen**
By keeping the last 64 heartbeats, we can run `nats kv history` to see the exact timeline. Did the Primary stop suddenly (crash)? or did the heartbeats become erratic and slow before stopping (network congestion)? This data is critical for optimizing the "Micro Data Centers" described in our vision, where internet connections in residential areas may vary in quality.
**Energy Efficiency & Resource Optimization**
NationTech aims to "maximize the value of our energy." A "flapping" cluster (constantly failing over and back) wastes immense energy in data re-synchronization and startup costs. By making the `failover_timeout_ms` configurable via `cluster_config`, we can tune a cluster heating a greenhouse to be less sensitive (slower failover is fine) compared to a cluster running a payment gateway.
**Decentralized Trust**
This architecture relies on NATS as the consensus engine. If the Primary is part of the NATS majority, it lives. If it isn't, it dies. This removes ambiguity and allows us to scale to thousands of independent sites without a central "God mode" controller managing every single failover.
## Consequences
**Positive**
* **Auditability:** Every failover event leaves a permanent trace in the KV history.
* **Safety:** The "Write Ack" check on the Primary provides a strong guarantee against Split Brain in `AbsoluteConsistency` mode.
* **Dynamic Tuning:** We can adjust timeouts for specific environments (e.g., high-latency satellite links) by updating a JSON key, requiring no downtime.
**Negative**
* **Storage Overhead:** Keeping history requires marginally more disk space on the NATS servers, though for 64 small JSON payloads, this is negligible.
* **Clock Skew:** While we rely on NATS server-side timestamps for ordering, extreme clock skew on the client side could confuse the debug logs (though not the failover logic itself).
## Alignment with Vision
This architecture supports the NationTech goal of a **"Beautifully Integrated Design."** It takes the complex, high-stakes problem of distributed consensus and wraps it in a mechanism that is robust enough for enterprise banking yet flexible enough to manage a basement server heating a swimming pool. It bridges the gap between the reliability of Web2 clouds and the decentralized nature of Web3 infrastructure.

View File

@@ -0,0 +1,233 @@
# ADR 020-1: Zitadel OIDC and OpenBao Integration for the Config Store
Author: Jean-Gabriel Gill-Couture
Date: 2026-03-18
## Status
Proposed
## Context
ADR 020 defines a unified `harmony_config` crate with a `ConfigStore` trait. The default team-oriented backend is OpenBao, which provides encrypted storage, versioned KV, audit logging, and fine-grained access control.
OpenBao requires authentication. The question is how developers authenticate without introducing new credentials to manage.
The goals are:
- **Zero new credentials.** Developers log in with their existing corporate identity (Google Workspace, GitHub, or Microsoft Entra ID / Azure AD).
- **Headless compatibility.** The flow must work over SSH, inside containers, and in CI — environments with no browser or localhost listener.
- **Minimal friction.** After a one-time login, authentication should be invisible for weeks of active use.
- **Centralized offboarding.** Revoking a user in the identity provider must immediately revoke their access to the config store.
## Decision
Developers authenticate to OpenBao through a two-step process: first, they obtain an OIDC token from Zitadel (`sso.nationtech.io`) using the OAuth 2.0 Device Authorization Grant (RFC 8628); then, they exchange that token for a short-lived OpenBao client token via OpenBao's JWT auth method.
### The authentication flow
#### Step 1: Trigger
The `ConfigManager` attempts to resolve a value via the `StoreSource`. The `StoreSource` checks for a cached OpenBao token in `~/.local/share/harmony/session.json`. If the token is missing or expired, authentication begins.
#### Step 2: Device Authorization Request
Harmony sends a `POST` to Zitadel's device authorization endpoint:
```
POST https://sso.nationtech.io/oauth/v2/device_authorization
Content-Type: application/x-www-form-urlencoded
client_id=<harmony_client_id>&scope=openid email profile offline_access
```
Zitadel responds with:
```json
{
"device_code": "dOcbPeysDhT26ZatRh9n7Q",
"user_code": "GQWC-FWFK",
"verification_uri": "https://sso.nationtech.io/device",
"verification_uri_complete": "https://sso.nationtech.io/device?user_code=GQWC-FWFK",
"expires_in": 300,
"interval": 5
}
```
#### Step 3: User prompt
Harmony prints the code and URL to the terminal:
```
[Harmony] To authenticate, open your browser to:
https://sso.nationtech.io/device
and enter code: GQWC-FWFK
Or visit: https://sso.nationtech.io/device?user_code=GQWC-FWFK
```
If a desktop environment is detected, Harmony also calls `open` / `xdg-open` to launch the browser automatically. The `verification_uri_complete` URL pre-fills the code, so the user only needs to click "Confirm" after logging in.
There is no localhost HTTP listener. The CLI does not need to bind a port or receive a callback. This is what makes the device flow work over SSH, in containers, and through corporate firewalls — unlike the `oc login` approach which spins up a temporary web server to catch a redirect.
#### Step 4: User login
The developer logs in through Zitadel's web UI using one of the configured identity providers:
- **Google Workspace** — for teams using Google as their corporate identity.
- **GitHub** — for open-source or GitHub-centric teams.
- **Microsoft Entra ID (Azure AD)** — for enterprise clients, particularly common in Quebec and the broader Canadian public sector.
Zitadel federates the login to the chosen provider. The developer authenticates with their existing corporate credentials. No new password is created.
#### Step 5: Polling
While the user is authenticating in the browser, Harmony polls Zitadel's token endpoint at the interval specified in the device authorization response (typically 5 seconds):
```
POST https://sso.nationtech.io/oauth/v2/token
Content-Type: application/x-www-form-urlencoded
grant_type=urn:ietf:params:oauth:grant-type:device_code
&device_code=dOcbPeysDhT26ZatRh9n7Q
&client_id=<harmony_client_id>
```
Before the user completes login, Zitadel responds with `authorization_pending`. Once the user consents, Zitadel returns:
```json
{
"access_token": "...",
"token_type": "Bearer",
"expires_in": 3600,
"refresh_token": "...",
"id_token": "eyJhbGciOiJSUzI1NiIs..."
}
```
The `scope=offline_access` in the initial request is what causes Zitadel to issue a `refresh_token`.
#### Step 6: OpenBao JWT exchange
Harmony sends the `id_token` (a JWT signed by Zitadel) to OpenBao's JWT auth method:
```
POST https://secrets.nationtech.io/v1/auth/jwt/login
Content-Type: application/json
{
"role": "harmony-developer",
"jwt": "eyJhbGciOiJSUzI1NiIs..."
}
```
OpenBao validates the JWT:
1. It fetches Zitadel's public keys from `https://sso.nationtech.io/oauth/v2/keys` (the JWKS endpoint).
2. It verifies the JWT signature.
3. It reads the claims (`email`, `groups`, and any custom claims mapped from the upstream identity provider, such as Azure AD tenant or Google Workspace org).
4. It evaluates the claims against the `bound_claims` and `bound_audiences` configured on the `harmony-developer` role.
5. If validation passes, OpenBao returns a client token:
```json
{
"auth": {
"client_token": "hvs.CAES...",
"policies": ["harmony-dev"],
"metadata": { "role": "harmony-developer" },
"lease_duration": 14400,
"renewable": true
}
}
```
Harmony caches the OpenBao token, the OIDC refresh token, and the token expiry timestamps to `~/.local/share/harmony/session.json` with `0600` file permissions.
### OpenBao storage structure
All configuration and secret state is stored in an OpenBao Versioned KV v2 engine.
Path taxonomy:
```
harmony/<organization>/<project>/<environment>/<key>
```
Examples:
```
harmony/nationtech/my-app/staging/PostgresConfig
harmony/nationtech/my-app/production/PostgresConfig
harmony/nationtech/my-app/local-shared/PostgresConfig
```
The `ConfigClass` (Standard vs. Secret) can influence OpenBao policy structure — for example, `Secret`-class paths could require stricter ACLs or additional audit backends — but the path taxonomy itself does not change. This is an operational concern configured in OpenBao policies, not a structural one enforced by path naming.
### Token lifecycle and silent refresh
The system manages three tokens with different lifetimes:
| Token | TTL | Max TTL | Purpose |
|---|---|---|---|
| OpenBao client token | 4 hours | 24 hours | Read/write config store |
| OIDC ID token | 1 hour | — | Exchange for OpenBao token |
| OIDC refresh token | 90 days absolute, 30 days inactivity | — | Obtain new ID tokens silently |
The refresh flow, from the developer's perspective:
1. **Same session (< 4 hours since last use).** The cached OpenBao token is still valid. No network call to Zitadel. Fastest path.
2. **Next day (OpenBao token expired, refresh token valid).** Harmony uses the OIDC `refresh_token` to request a new `id_token` from Zitadel's token endpoint (`grant_type=refresh_token`). It then exchanges the new `id_token` for a fresh OpenBao token. This happens silently. The developer sees no prompt.
3. **OpenBao token near max TTL (approaching 24 hours of cumulative renewals).** Instead of renewing, Harmony re-authenticates using the refresh token to get a completely fresh OpenBao token. Transparent to the user.
4. **After 30 days of inactivity.** The OIDC refresh token expires. Harmony falls back to the device flow (Step 2 above) and prompts the user to re-authenticate in the browser. This is the only scenario where a returning developer sees a login prompt.
5. **User offboarded.** An administrator revokes the user's account or group membership in Zitadel. The next time the refresh token is used, Zitadel rejects it. The device flow also fails because the user can no longer authenticate. Access is terminated without any action needed on the OpenBao side.
OpenBao token renewal uses the `/auth/token/renew-self` endpoint with the `X-Vault-Token` header. Harmony renews proactively at ~75% of the TTL to avoid race conditions.
### OpenBao role configuration
The OpenBao JWT auth role for Harmony developers:
```bash
bao write auth/jwt/config \
oidc_discovery_url="https://sso.nationtech.io" \
bound_issuer="https://sso.nationtech.io"
bao write auth/jwt/role/harmony-developer \
role_type="jwt" \
bound_audiences="<harmony_client_id>" \
user_claim="email" \
groups_claim="urn:zitadel:iam:org:project:roles" \
policies="harmony-dev" \
ttl="4h" \
max_ttl="24h" \
token_type="service"
```
The `bound_audiences` claim ties the role to the specific Harmony Zitadel application. The `groups_claim` allows mapping Zitadel project roles to OpenBao policies for per-team or per-project access control.
### Self-hosted deployments
For organizations running their own infrastructure, the same architecture applies. The operator deploys Zitadel and OpenBao using Harmony's existing `ZitadelScore` and `OpenbaoScore`. The only configuration needed is three environment variables (or their equivalents in the bootstrap config):
- `HARMONY_SSO_URL` — the Zitadel instance URL.
- `HARMONY_SECRETS_URL` — the OpenBao instance URL.
- `HARMONY_SSO_CLIENT_ID` — the Zitadel application client ID.
None of these are secrets. They can be committed to an infrastructure repository or distributed via any convenient channel.
## Consequences
### Positive
- Developers authenticate with existing corporate credentials. No new passwords, no static tokens to distribute.
- The device flow works in every environment: local terminal, SSH, containers, CI runners, corporate VPNs.
- Silent token refresh keeps developers authenticated for weeks without any manual intervention.
- User offboarding is a single action in Zitadel. No OpenBao token rotation or manual revocation required.
- Azure AD / Microsoft Entra ID support addresses the enterprise and public sector market.
### Negative
- The OAuth state machine (device code polling, token refresh, error handling) adds implementation complexity compared to a static token approach.
- Developers must have network access to `sso.nationtech.io` and `secrets.nationtech.io` to pull or push configuration state. True offline work falls back to the local file store, which does not sync with the team.
- The first login per machine requires a browser interaction. Fully headless first-run scenarios (e.g., a fresh CI runner with no pre-seeded tokens) must use `EnvSource` overrides or a service account JWT.

View File

@@ -0,0 +1,177 @@
# ADR 020: Unified Configuration and Secret Management
Author: Jean-Gabriel Gill-Couture
Date: 2026-03-18
## Status
Proposed
## Context
Harmony's orchestration logic depends on runtime data that falls into two categories:
1. **Secrets** — credentials, tokens, private keys.
2. **Operational configuration** — deployment targets, host selections, port assignments, reboot decisions, and similar contextual choices.
Both categories share the same fundamental lifecycle: a value must be acquired before execution can proceed, it may come from several backends (environment variable, remote store, interactive prompt), and it must be shareable across a team without polluting the Git repository.
Treating these categories as separate subsystems forces developers to choose between a "config API" and a "secret API" at every call site. The only meaningful difference between the two is how the storage backend handles the data (plaintext vs. encrypted, audited vs. unaudited) and how the CLI displays it (visible vs. masked). That difference belongs in the backend, not in the application code.
Three concrete problems drive this change:
- **Async terminal corruption.** `inquire` prompts assume exclusive terminal ownership. Background tokio tasks emitting log output during a prompt corrupt the terminal state. This is inherent to Harmony's concurrent orchestration model.
- **Untestable code paths.** Any function containing an inline `inquire` call requires a real TTY to execute. Unit testing is impossible without ignoring the test entirely.
- **No backend integration.** Inline prompts cannot be answered from a remote store, an environment variable, or a CI pipeline. Every automated deployment that passes through a prompting code path requires a human operator at a terminal.
## Decision
A single workspace crate, `harmony_config`, provides all configuration and secret acquisition for Harmony. It replaces both `harmony_secret` and all inline `inquire` usage.
### Schema in Git, state in the store
The Rust type system serves as the configuration schema. Developers declare what configuration is needed by defining structs:
```rust
#[derive(Config, Serialize, Deserialize, JsonSchema, InteractiveParse)]
struct PostgresConfig {
pub host: String,
pub port: u16,
#[config(secret)]
pub password: String,
}
```
These structs live in Git and evolve with the code. When a branch introduces a new field, Git tracks that schema change. The actual values live in an external store — OpenBao by default. No `.env` files, no JSON config files, no YAML in the repository.
### Data classification
```rust
/// Tells the storage backend how to handle the data.
pub enum ConfigClass {
/// Plaintext storage is acceptable.
Standard,
/// Must be encrypted at rest, masked in UI, subject to audit logging.
Secret,
}
```
Classification is determined at the struct level. A struct with no `#[config(secret)]` fields has `ConfigClass::Standard`. A struct with one or more `#[config(secret)]` fields is elevated to `ConfigClass::Secret`. The struct is always stored as a single cohesive JSON blob; field-level splitting across backends is not a concern of the trait.
The `#[config(secret)]` attribute also instructs the `PromptSource` to mask terminal input for that field during interactive prompting.
### The Config trait
```rust
pub trait Config: Serialize + DeserializeOwned + JsonSchema + InteractiveParseObj + Sized {
/// Stable lookup key. By default, the struct name.
const KEY: &'static str;
/// How the backend should treat this data.
const CLASS: ConfigClass;
}
```
A `#[derive(Config)]` proc macro generates the implementation. The macro inspects field attributes to determine `CLASS`.
### The ConfigStore trait
```rust
#[async_trait]
pub trait ConfigStore: Send + Sync {
async fn get(
&self,
class: ConfigClass,
namespace: &str,
key: &str,
) -> Result<Option<serde_json::Value>, ConfigError>;
async fn set(
&self,
class: ConfigClass,
namespace: &str,
key: &str,
value: &serde_json::Value,
) -> Result<(), ConfigError>;
}
```
The `class` parameter is a hint. The store implementation decides what to do with it. An OpenBao store may route `Secret` data to a different path prefix or apply stricter ACLs. A future store could split fields across backends — that is an implementation concern, not a trait concern.
### Resolution chain
The `ConfigManager` tries sources in priority order:
1. **`EnvSource`** — reads `HARMONY_CONFIG_{KEY}` as a JSON string. Override hatch for CI/CD pipelines and containerized environments.
2. **`StoreSource`** — wraps a `ConfigStore` implementation. For teams, this is the OpenBao backend authenticated via Zitadel OIDC (see ADR 020-1).
3. **`PromptSource`** — presents an `interactive-parse` prompt on the terminal. Acquires a process-wide async mutex before rendering to prevent log output corruption.
When `PromptSource` obtains a value, the `ConfigManager` persists it back to the `StoreSource` so that subsequent runs — by the same developer or any teammate — resolve without prompting.
Callers that do not include `PromptSource` in their source list never block on a TTY. Test code passes empty source lists and constructs config structs directly.
### Schema versioning
The Rust struct is the schema. When a developer renames a field, removes a field, or changes a type on a branch, the store may still contain data shaped for a previous version of the struct. If another team member who does not yet have that commit runs the code, `serde_json::from_value` will fail on the stale entry.
In the initial implementation, the resolution chain handles this gracefully: a deserialization failure is treated as a cache miss, and the `PromptSource` fires. The prompted value overwrites the stale entry in the store.
This is sufficient for small teams working on short-lived branches. It is not sufficient at scale, where silent re-prompting could mask real configuration drift.
A future iteration will introduce a compile-time schema migration mechanism, similar to how `sqlx` verifies queries against a live database at compile time. The mechanism will:
- Detect schema drift between the Rust struct and the stored JSON.
- Apply named, ordered migration functions to transform stored data forward.
- Reject ambiguous migrations at compile time rather than silently corrupting state.
Until that mechanism exists, teams should treat store entries as soft caches: the struct definition is always authoritative, and the store is best-effort.
## Rationale
**Why merge secrets and config into one crate?** Separate crates with nearly identical trait shapes (`Secret` vs `Config`, `SecretStore` vs `ConfigStore`) force developers to make a classification decision at every call site. A unified crate with a `ConfigClass` discriminator moves that decision to the struct definition, where it belongs.
**Why OpenBao as the default backend?** OpenBao is a fully open-source Vault fork under the Linux Foundation. It runs on-premises with no phone-home requirement — a hard constraint for private cloud and regulated environments. Harmony already deploys OpenBao for clients (`OpenbaoScore`), so no new infrastructure is introduced.
**Why not store values in Git (e.g., encrypted YAML)?** Git-tracked config files create merge conflicts, require re-encryption on team membership changes, and leak metadata (file names, key names) even when values are encrypted. Storing state in OpenBao avoids all of these issues and provides audit logging, access control, and versioned KV out of the box.
**Why keep `PromptSource`?** Removing interactive prompts entirely would break the zero-infrastructure bootstrapping path and eliminate human-confirmation safety gates for destructive operations (interface reconfiguration, node reboot). The problem was never that prompts exist — it is that they were unavoidable and untestable. Making `PromptSource` an explicit, opt-in entry in the source list restores control.
## Consequences
### Positive
- A single API surface for all runtime data acquisition.
- All currently-ignored tests become runnable without TTY access.
- Async terminal corruption is eliminated by the process-wide prompt mutex.
- The bootstrapping path requires no infrastructure for a first run; `PromptSource` alone is sufficient.
- The team path (OpenBao + Zitadel) reuses infrastructure Harmony already deploys.
- User offboarding is a single Zitadel action.
### Negative
- Migrating all inline `inquire` and `harmony_secret` call sites is a significant refactoring effort.
- Until the schema migration mechanism is built, store entries for renamed or removed fields become stale and must be re-prompted.
- The Zitadel device flow introduces a browser step on first login per machine.
## Implementation Plan
### Phase 1: Trait design and crate restructure
Refactor `harmony_config` to define the final `Config`, `ConfigClass`, and `ConfigStore` traits. Update the derive macro to support `#[config(secret)]` and generate the correct `CLASS` constant. Implement `EnvSource` and `PromptSource` against the new traits. Write comprehensive unit tests using mock stores.
### Phase 2: Absorb `harmony_secret`
Migrate the `OpenbaoSecretStore`, `InfisicalSecretStore`, and `LocalFileSecretStore` implementations from `harmony_secret` into `harmony_config` as `ConfigStore` backends. Update all call sites that use `SecretManager::get`, `SecretManager::get_or_prompt`, or `SecretManager::set` to use `harmony_config` equivalents.
### Phase 3: Migrate inline prompts
Replace all inline `inquire` call sites in the `harmony` crate (`infra/brocade.rs`, `infra/network_manager.rs`, `modules/okd/host_network.rs`, and others) with `harmony_config` structs and `get_or_prompt` calls. Un-ignore the affected tests.
### Phase 4: Zitadel and OpenBao integration
Implement the authentication flow described in ADR 020-1. Wire `StoreSource` to use Zitadel OIDC tokens for OpenBao access. Implement token caching and silent refresh.
### Phase 5: Remove `harmony_secret`
Delete the `harmony_secret` and `harmony_secret_derive` crates from the workspace. All functionality now lives in `harmony_config`.

63
docs/adr/README.md Normal file
View File

@@ -0,0 +1,63 @@
# Architecture Decision Records
An Architecture Decision Record (ADR) documents a significant architectural decision made during the development of Harmony — along with its context, rationale, and consequences.
## Why We Use ADRs
As a platform engineering framework used by a team, Harmony accumulates technical decisions over time. ADRs help us:
- **Track rationale** — understand _why_ a decision was made, not just _what_ was decided
- ** onboard new contributors** — the "why" is preserved even when team membership changes
- **Avoid repeating past mistakes** — previous decisions and their context are searchable
- **Manage technical debt** — ADRs make it easier to revisit and revise past choices
An ADR captures a decision at a point in time. It is not a specification — it is a record of reasoning.
## ADR Format
Every ADR follows this structure:
| Section | Purpose |
|---------|---------|
| **Status** | Proposed / Pending / Accepted / Implemented / Deprecated |
| **Context** | The problem or background — the "why" behind this decision |
| **Decision** | The chosen solution or direction |
| **Rationale** | Reasoning behind the decision |
| **Consequences** | Both positive and negative outcomes |
| **Alternatives considered** | Other options that were evaluated |
| **Additional Notes** | Supplementary context, links, or open questions |
## ADR Index
| Number | Title | Status |
|--------|-------|--------|
| [000](./000-ADR-Template.md) | ADR Template | Reference |
| [001](./001-rust.md) | Why Rust | Accepted |
| [002](./002-hexagonal-architecture.md) | Hexagonal Architecture | Accepted |
| [003](./003-infrastructure-abstractions.md) | Infrastructure Abstractions | Accepted |
| [004](./004-ipxe.md) | iPXE | Accepted |
| [005](./005-interactive-project.md) | Interactive Project | Proposed |
| [006](./006-secret-management.md) | Secret Management | Accepted |
| [007](./007-default-runtime.md) | Default Runtime | Accepted |
| [008](./008-score-display-formatting.md) | Score Display Formatting | Proposed |
| [009](./009-helm-and-kustomize-handling.md) | Helm and Kustomize Handling | Accepted |
| [010](./010-monitoring-and-alerting.md) | Monitoring and Alerting | Accepted |
| [011](./011-multi-tenant-cluster.md) | Multi-Tenant Cluster | Accepted |
| [012](./012-project-delivery-automation.md) | Project Delivery Automation | Proposed |
| [013](./013-monitoring-notifications.md) | Monitoring Notifications | Accepted |
| [015](./015-higher-order-topologies.md) | Higher Order Topologies | Proposed |
| [016](./016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md) | Harmony Agent and Global Mesh | Proposed |
| [017-1](./017-1-Nats-Clusters-Interconnection-Topology.md) | NATS Clusters Interconnection Topology | Proposed |
| [018](./018-Template-Hydration-For-Workload-Deployment.md) | Template Hydration for Workload Deployment | Proposed |
| [019](./019-Network-bond-setup.md) | Network Bond Setup | Proposed |
| [020-1](./020-1-zitadel-openbao-secure-config-store.md) | Zitadel + OpenBao Secure Config Store | Accepted |
| [020](./020-interactive-configuration-crate.md) | Interactive Configuration Crate | Proposed |
## Contributing
When making a significant technical change:
1. **Check existing ADRs** — the decision may already be documented
2. **Create a new ADR** using the [template](./000-ADR-Template.md) if the change warrants architectural discussion
3. **Set status to Proposed** and open it for team review
4. Once accepted and implemented, update the status accordingly

View File

@@ -0,0 +1,181 @@
# Harmony Architecture — Three Open Challenges
Three problems that, if solved well, would make Harmony the most capable infrastructure automation framework in existence.
## 1. Topology Evolution During Deployment
### The problem
A bare-metal OKD deployment is a multi-hour process where the infrastructure's capabilities change as the deployment progresses:
```
Phase 0: Network only → OPNsense reachable, Brocade reachable, no hosts
Phase 1: Discovery → PXE boots work, hosts appear via mDNS, no k8s
Phase 2: Bootstrap → openshift-install running, API partially available
Phase 3: Control plane → k8s API available, operators converging, no workers
Phase 4: Workers → Full cluster, apps can be deployed
Phase 5: Day-2 → Monitoring, alerting, tenant onboarding
```
Today, `HAClusterTopology` implements _all_ capability traits from the start. If a Score calls `k8s_client()` during Phase 0, it hits `DummyInfra` which panics. The type system says "this is valid" but the runtime says "this will crash."
### Why it matters
- Scores that require k8s compile and register happily at Phase 0, then panic if accidentally executed too early
- The pipeline is ordered by convention (Stage 01 → 02 → 03 → ...) but nothing enforces that Stage 04 can't run before Stage 02
- Adding new capabilities (like "cluster has monitoring installed") requires editing the topology struct, not declaring the capability was acquired
### Design direction
The topology should evolve through **phases** where capabilities are _acquired_, not assumed. Two possible approaches:
**A. Phase-gated topology (runtime)**
The topology tracks which phase it's in. Capability methods check the phase before executing and return a meaningful error instead of panicking:
```rust
impl K8sclient for HAClusterTopology {
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
if self.phase < Phase::ControlPlaneReady {
return Err("k8s API not available yet (current phase: {})".into());
}
// ... actual implementation
}
}
```
Scores that fail due to phase mismatch get a clear error message, not a panic. The Maestro can validate phase requirements before executing a Score.
**B. Typestate topology (compile-time)**
Use Rust's type system to make invalid phase transitions unrepresentable:
```rust
struct Topology<P: Phase> { ... }
impl Topology<NetworkReady> {
fn bootstrap(self) -> Topology<Bootstrapping> { ... }
}
impl Topology<Bootstrapping> {
fn promote(self) -> Topology<ClusterReady> { ... }
}
// Only ClusterReady implements K8sclient
impl K8sclient for Topology<ClusterReady> { ... }
```
This is the "correct" Rust approach but requires significant refactoring and may be too rigid for real deployments where phases overlap.
**Recommendation**: Start with (A) — runtime phase tracking. It's additive (no breaking changes), catches the DummyInfra panic problem immediately, and provides the data needed for (B) later.
---
## 2. Runtime Plan & Validation Phase
### The problem
Harmony validates Scores at compile time: if a Score requires `DhcpServer + TftpServer`, the topology must implement both traits or the program won't compile. This is powerful but insufficient.
What compile-time _cannot_ check:
- Is the OPNsense API actually reachable right now?
- Does VLAN 100 already exist (so we can skip creating it)?
- Is there already a DHCP entry for this MAC address?
- Will this firewall rule conflict with an existing one?
- Is there enough disk space on the TFTP server for the boot images?
Today, these are discovered at execution time, deep inside an Interpret's `execute()` method. A failure at minute 45 of a deployment is expensive.
### Why it matters
- No way to preview what Harmony will do before it does it
- No way to detect conflicts or precondition failures early
- Operators must read logs to understand what happened — there's no structured "here's what I did" report
- Re-running a deployment is scary because you don't know what will be re-applied vs skipped
### Design direction
Add a **validate** phase to the Score/Interpret lifecycle:
```rust
#[async_trait]
pub trait Interpret<T>: Debug + Send {
/// Check preconditions and return what this interpret WOULD do.
/// Default implementation returns "will execute" (opt-in validation).
async fn validate(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<ValidationReport, InterpretError> {
Ok(ValidationReport::will_execute(self.get_name()))
}
/// Execute the interpret (existing method, unchanged).
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError>;
// ... existing methods
}
```
A `ValidationReport` would contain:
- **Status**: `WillCreate`, `WillUpdate`, `WillDelete`, `AlreadyApplied`, `Blocked(reason)`
- **Details**: human-readable description of planned changes
- **Preconditions**: list of checks performed and their results
The Maestro would run validation for all registered Scores before executing any of them, producing a plan that the operator reviews.
This is opt-in: Scores that don't implement `validate()` get a default "will execute" report. Over time, each Score adds validation logic. The OPNsense Scores are ideal first candidates since they can query current state via the API.
### Relationship to state
This approach does _not_ require a state file. Validation queries the infrastructure directly — the same philosophy Harmony already follows. The "plan" is computed fresh every time by asking the infrastructure what exists right now.
---
## 3. TUI as Primary Interface
### The problem
The TUI (`harmony_tui`) exists with ratatui, crossterm, and tui-logger, but it's underused. The CLI (`harmony_cli`) is the primary interface. During a multi-hour deployment, operators watch scrolling log output with no structure, no ability to drill into a specific Score's progress, and no overview of where they are in the pipeline.
### Why it matters
- Log output during interactive prompts corrupts the terminal
- No way to see "I'm on Stage 3 of 7, 2 hours elapsed, 3 Scores completed successfully"
- No way to inspect a Score's configuration or outcome without reading logs
- The pipeline feels like a black box during execution
### Design direction
The TUI should provide three views:
**Pipeline view** — the default. Shows the ordered list of Scores with their status:
```
OKD HA Cluster Deployment [Stage 3/7 — 1h 42m elapsed]
──────────────────────────────────────────────────────────────────
✅ OKDIpxeScore 2m 14s
✅ OKDSetup01InventoryScore 8m 03s
✅ OKDSetup02BootstrapScore 34m 21s
▶ OKDSetup03ControlPlaneScore ... running
⏳ OKDSetupPersistNetworkBondScore
⏳ OKDSetup04WorkersScore
⏳ OKDSetup06InstallationReportScore
```
**Detail view** — press Enter on a Score to see its Outcome details, sub-score executions, and logs.
**Log view** — the current tui-logger panel, filtered to the selected Score.
The TUI already has the Score widget and log integration. What's missing is the pipeline-level orchestration view and the duration/status data — which the `Score::interpret` timing we just added now provides.
### Immediate enablers
The instrumentation event system (`HarmonyEvent`) already captures start/finish with execution IDs. The TUI subscriber just needs to:
1. Track the ordered list of Scores from the Maestro
2. Update status as `InterpretExecutionStarted`/`Finished` events arrive
3. Render the pipeline view using ratatui
This doesn't require architectural changes — it's a TUI feature built on existing infrastructure.

View File

@@ -84,7 +84,7 @@ Network services that run inside the cluster or as part of the topology.
- **OKDLoadBalancerScore**: Configures the high-availability load balancers for the OKD API and ingress.
- **OKDBootstrapLoadBalancerScore**: Configures the load balancer specifically for the bootstrap-time API endpoint.
- **K8sIngressScore**: Configures an Ingress controller or resource.
- [HighAvailabilityHostNetworkScore](../../harmony/src/modules/okd/host_network.rs): Configures network bonds on a host and the corresponding port-channels on the switch stack for high-availability.
- **HighAvailabilityHostNetworkScore**: Configures network bonds on a host and the corresponding port-channels on the switch stack for high-availability.
## Tenant Management

View File

@@ -8,16 +8,6 @@ We use here the context of the KVM module to explain the coding style. This will
## Core Philosophy
### The Careful Craftsman Principle
Harmony is a powerful framework that does a lot. With that power comes responsibility. Every abstraction, every trait, every module must earn its place. Before adding anything, ask:
1. **Does this solve a real problem users have?** Not a theoretical problem, an actual one encountered in production.
2. **Is this the simplest solution that works?** Complexity is a cost that compounds over time.
3. **Will this make the next developer's life easier or harder?** Code is read far more often than written.
When in doubt, don't abstract. Wait for the pattern to emerge from real usage. A little duplication is better than the wrong abstraction.
### High-level functions over raw primitives
Callers should not need to know about underlying protocols, XML schemas, or API quirks. A function that deploys a VM should accept meaningful parameters like CPU count, memory, and network name — not XML strings.
@@ -237,63 +227,3 @@ docs: add coding guide
```
Keep pull requests small and single-purpose (under ~200 lines excluding generated code). Do not mix refactoring, bug fixes, and new features in one PR.
---
## When to Add Abstractions
Harmony provides powerful abstraction mechanisms: traits, generics, the Score/Interpret pattern, and capabilities. Use them judiciously.
### Add an abstraction when:
- **You have three or more concrete implementations** doing the same thing. Two is often coincidence; three is a pattern.
- **The abstraction provides compile-time safety** that prevents real bugs (e.g., capability bounds on topologies).
- **The abstraction hides genuine complexity** that callers shouldn't need to understand (e.g., XML schema generation for libvirt).
### Don't add an abstraction when:
- **It's just to avoid a few lines of boilerplate**. Copy-paste is sometimes better than a trait hierarchy.
- **You're anticipating future flexibility** that isn't needed today. YAGNI (You Aren't Gonna Need It).
- **The abstraction makes the code harder to understand** for someone unfamiliar with the codebase.
- **You're wrapping a single implementation**. A trait with one implementation is usually over-engineering.
### Signs you've over-abstracted:
- You need to explain the type system to a competent Rust developer for them to understand how to add a simple feature.
- Adding a new concrete type requires changes in multiple trait definitions.
- The word "factory" or "manager" appears in your type names.
- You have more trait definitions than concrete implementations.
### The Rule of Three for Traits
Before creating a new trait, ensure you have:
1. A clear, real use case (not hypothetical)
2. At least one concrete implementation
3. A plan for how callers will use it
Only generalize when the pattern is proven. The monitoring module is a good example: we had multiple alert senders (OKD, KubePrometheus, RHOB) before we introduced the `AlertSender` and `AlertReceiver<S>` traits. The traits emerged from real needs, not design sessions.
---
## Documentation
### Document the "why", not the "what"
Code should be self-explanatory for the "what". Comments and documentation should explain intent, rationale, and gotchas.
```rust
// Bad: restates the code
// Returns the number of VMs
fn vm_count(&self) -> usize { self.vms.len() }
// Good: explains the why
// Returns 0 if connection is lost, rather than erroring,
// because monitoring code uses this for health checks
fn vm_count(&self) -> usize { self.vms.len() }
```
### Keep examples in the `examples/` directory
Working code beats documentation. Every major feature should have a runnable example that demonstrates real usage.

View File

@@ -28,6 +28,11 @@ Harmony's design is based on a few key concepts. Understanding them is the key t
- **What it is:** An **Inventory** is the physical material (the "what") used in a cluster. This is most relevant for bare-metal or on-premise topologies.
- **Example:** A list of nodes with their roles (control plane, worker), CPU, RAM, and network interfaces. For the `K8sAnywhereTopology`, the inventory might be empty or autoloaded, as the infrastructure is more abstract.
### 6. Configuration & Secrets
- **What it is:** Configuration represents the runtime data required to deploy your `Scores`. This includes both non-sensitive state (like cluster hostnames, deployment profiles) and sensitive secrets (like API keys, database passwords).
- **How it works:** See the [Configuration Concept Guide](./concepts/configuration.md) to understand Harmony's unified approach to managing schema in Git and state in OpenBao.
---
### How They Work Together (The Compile-Time Check)

View File

@@ -0,0 +1,107 @@
# Configuration and Secrets
Harmony treats configuration and secrets as a single concern. Developers use one crate, `harmony_config`, to declare, store, and retrieve all runtime data — whether it is a public hostname or a database password.
## The mental model: schema in Git, state in the store
### Schema
In Harmony, the Rust code is the configuration schema. You declare what your module needs by defining a struct:
```rust
#[derive(Config, Serialize, Deserialize, JsonSchema, InteractiveParse)]
struct PostgresConfig {
pub host: String,
pub port: u16,
#[config(secret)]
pub password: String,
}
```
This struct is tracked in Git. When a branch adds a new field, Git tracks that the branch requires a new value. When a branch removes a field, the old value in the store becomes irrelevant. The struct is always authoritative.
### State
The actual values live in a config store — by default, OpenBao. No `.env` files, no JSON, no YAML in the repository.
When you run your code, Harmony reads the struct (schema) and resolves values from the store (state):
- If the store has the value, it is injected seamlessly.
- If the store does not have it, Harmony prompts you in the terminal. Your answer is pushed back to the store automatically.
- When a teammate runs the same code, they are not prompted — you already provided the value.
### How branch switching works
Because the schema is just Rust code tracked in Git, branch switching works naturally:
1. You check out `feat/redis`. The code now requires `RedisConfig`.
2. You run `cargo run`. Harmony detects that `RedisConfig` has no value in the store. It prompts you.
3. You provide the values. Harmony pushes them to OpenBao.
4. Your teammate checks out `feat/redis` and runs `cargo run`. No prompt — the values are already in the store.
5. You switch back to `main`. `RedisConfig` does not exist in that branch's code. The store entry is ignored.
## Secrets vs. standard configuration
From your application code, there is no difference. You always call `harmony_config::get_or_prompt::<T>()`.
The difference is in the struct definition:
```rust
// Standard config — stored in plaintext, displayed during prompting.
#[derive(Config)]
struct ClusterConfig {
pub api_url: String,
pub namespace: String,
}
// Contains a secret field — the entire struct is stored encrypted,
// and the password field is masked during terminal prompting.
#[derive(Config)]
struct DatabaseConfig {
pub host: String,
#[config(secret)]
pub password: String,
}
```
If a struct contains any `#[config(secret)]` field, Harmony elevates the entire struct to `ConfigClass::Secret`. The storage backend decides what that means in practice — in the case of OpenBao, it may route the data to a path with stricter ACLs or audit policies.
## Authentication and team sharing
Harmony uses Zitadel (hosted at `sso.nationtech.io`) for identity and OpenBao (hosted at `secrets.nationtech.io`) for storage.
**First run on a new machine:**
1. Harmony detects that you are not logged in.
2. It prints a short code and URL to your terminal, and opens your browser if possible.
3. You log in with your corporate identity (Google, GitHub, or Microsoft Entra ID / Azure AD).
4. Harmony receives an OIDC token, exchanges it for an OpenBao token, and caches the session locally.
**Subsequent runs:**
- Harmony silently refreshes your tokens in the background. You do not need to log in again for up to 90 days of active use.
- If you are inactive for 30 days, or if an administrator revokes your access in Zitadel, you will be prompted to re-authenticate.
**Offboarding:**
Revoking a user in Zitadel immediately invalidates their ability to refresh tokens or obtain new ones. No manual secret rotation is required.
## Resolution chain
When Harmony resolves a config value, it tries sources in order:
1. **Environment variable** (`HARMONY_CONFIG_{KEY}`) — highest priority. Use this in CI/CD to override any value without touching the store.
2. **Config store** (OpenBao for teams, local file for solo/offline use) — the primary source for shared team state.
3. **Interactive prompt** — last resort. Prompts the developer and persists the answer back to the store.
## Schema versioning
The Rust struct is the single source of truth for what configuration looks like. If a developer renames or removes a field on a branch, the store may still contain data shaped for the old version of the struct. When another developer who does not have that change runs the code, deserialization will fail.
In the current implementation, this is handled gracefully: a deserialization failure is treated as a miss, and Harmony re-prompts. The new answer overwrites the stale entry.
A compile-time migration mechanism is planned for a future release to handle this more rigorously at scale.
## Offline and local development
If you are working offline or evaluating Harmony without a team OpenBao instance, the `StoreSource` falls back to a local file store at `~/.local/share/harmony/config/`. The developer experience is identical — prompting, caching, and resolution all work the same way. The only difference is that the state is local to your machine and not shared with teammates.

View File

@@ -0,0 +1,135 @@
# Adding Capabilities
`Capabilities` are trait methods that a `Topology` exposes to Scores. They are the "how" — the specific APIs and features that let a Score translate intent into infrastructure actions.
## How Capabilities Work
When a Score declares it needs certain Capabilities:
```rust
impl<T: Topology + K8sclient + HelmCommand> Score<T> for MyScore {
// ...
}
```
The compiler verifies that the target `Topology` implements both `K8sclient` and `HelmCommand`. If it doesn't, compilation fails. This is the compile-time safety check that prevents invalid configurations from reaching production.
## Built-in Capabilities
Harmony provides a set of standard Capabilities:
| Capability | What it provides |
|------------|------------------|
| `K8sclient` | A Kubernetes API client |
| `HelmCommand` | A configured `helm` CLI invocation |
| `TlsRouter` | TLS certificate management |
| `NetworkManager` | Host network configuration |
| `SwitchClient` | Network switch configuration |
| `CertificateManagement` | Certificate issuance via cert-manager |
## Implementing a Capability
Capabilities are implemented as trait methods on your Topology:
```rust
use std::sync::Arc;
use harmony_k8s::K8sClient;
use harmony::topology::K8sclient;
pub struct MyTopology {
kubeconfig: Option<String>,
}
#[async_trait]
impl K8sclient for MyTopology {
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
let client = match &self.kubeconfig {
Some(path) => K8sClient::from_kubeconfig(path).await?,
None => K8sClient::try_default().await?,
};
Ok(Arc::new(client))
}
}
```
## Adding a Custom Capability
For specialized infrastructure needs, add your own Capability as a trait:
```rust
use async_trait::async_trait;
use crate::executors::ExecutorError;
/// A capability for configuring network switches
#[async_trait]
pub trait SwitchClient: Send + Sync {
async fn configure_port(
&self,
switch: &str,
port: &str,
vlan: u16,
) -> Result<(), ExecutorError>;
async fn configure_port_channel(
&self,
switch: &str,
name: &str,
ports: &[&str],
) -> Result<(), ExecutorError>;
}
```
Then implement it on your Topology:
```rust
use harmony_infra::brocade::BrocadeClient;
pub struct MyTopology {
switch_client: Arc<dyn SwitchClient>,
}
impl SwitchClient for MyTopology {
async fn configure_port(&self, switch: &str, port: &str, vlan: u16) -> Result<(), ExecutorError> {
self.switch_client.configure_port(switch, port, vlan).await
}
async fn configure_port_channel(&self, switch: &str, name: &str, ports: &[&str]) -> Result<(), ExecutorError> {
self.switch_client.configure_port_channel(switch, name, ports).await
}
}
```
Now Scores that need `SwitchClient` can run on `MyTopology`.
## Capability Composition
Topologies often compose multiple Capabilities to support complex Scores:
```rust
pub struct HAClusterTopology {
pub kubeconfig: Option<String>,
pub router: Arc<dyn Router>,
pub load_balancer: Arc<dyn LoadBalancer>,
pub switch_client: Arc<dyn SwitchClient>,
pub dhcp_server: Arc<dyn DhcpServer>,
pub dns_server: Arc<dyn DnsServer>,
// ...
}
impl K8sclient for HAClusterTopology { ... }
impl HelmCommand for HAClusterTopology { ... }
impl SwitchClient for HAClusterTopology { ... }
impl DhcpServer for HAClusterTopology { ... }
impl DnsServer for HAClusterTopology { ... }
impl Router for HAClusterTopology { ... }
impl LoadBalancer for HAClusterTopology { ... }
```
A Score that needs all of these can run on `HAClusterTopology` because the Topology provides all of them.
## Best Practices
- **Keep Capabilities focused** — one Capability per concern (Kubernetes client, Helm, switch config)
- **Return meaningful errors** — use specific error types so Scores can handle failures appropriately
- **Make Capabilities optional where sensible** — not every Topology needs every Capability; use `Option<T>` or a separate trait for optional features
- **Document preconditions** — if a Capability requires the infrastructure to be in a specific state, document it in the trait doc comments

View File

@@ -0,0 +1,40 @@
# Developer Guide
This section covers how to extend Harmony by building your own `Score`, `Topology`, and `Capability` implementations.
## Writing a Score
A `Score` is a declarative description of desired state. To create your own:
1. Define a struct that represents your desired state
2. Implement the `Score<T>` trait, where `T` is your target `Topology`
3. Implement the `Interpret<T>` trait to define how the Score translates to infrastructure actions
See the [Writing a Score](./writing-a-score.md) guide for a step-by-step walkthrough.
## Writing a Topology
A `Topology` models your infrastructure environment. To create your own:
1. Define a struct that holds your infrastructure configuration
2. Implement the `Topology` trait
3. Implement the `Capability` traits your Score needs
See the [Writing a Topology](./writing-a-topology.md) guide for details.
## Adding Capabilities
`Capabilities` are the specific APIs or features a `Topology` exposes. They are the bridge between Scores and the actual infrastructure.
See the [Adding Capabilities](./adding-capabilities.md) guide for details on implementing and exposing Capabilities.
## Core Traits Reference
| Trait | Purpose |
|-------|---------|
| `Score<T>` | Declares desired state ("what") |
| `Topology` | Represents infrastructure ("where") |
| `Interpret<T>` | Execution logic ("how") |
| `Capability` | A feature exposed by a Topology |
See [Core Concepts](../concepts.md) for the conceptual foundation.

View File

@@ -1,42 +1,230 @@
# Getting Started Guide
Welcome to Harmony! This guide will walk you through installing the Harmony framework, setting up a new project, and deploying your first application.
This guide walks you through deploying your first application with Harmony — a PostgreSQL cluster on a local Kubernetes cluster (K3D). By the end, you'll understand the core workflow: compile a Score, run it through the Harmony CLI, and verify the result.
We will build and deploy the "Rust Web App" example, which automatically:
## What you'll deploy
1. Provisions a local K3D (Kubernetes in Docker) cluster.
2. Deploys a sample Rust web application.
3. Sets up monitoring for the application.
A fully functional PostgreSQL cluster running in a local K3D cluster, managed by the CloudNativePG operator. This demonstrates the full Harmony pattern:
1. Provision a local Kubernetes cluster (K3D)
2. Install the required operator (CloudNativePG)
3. Create a PostgreSQL cluster
4. Expose it as a Kubernetes Service
## Prerequisites
Before you begin, you'll need a few tools installed on your system:
Before you begin, install the following tools:
- **Rust & Cargo:** [Install Rust](https://www.rust-lang.org/tools/install)
- **Docker:** [Install Docker](https://docs.docker.com/get-docker/) (Required for the K3D local cluster)
- **kubectl:** [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (For inspecting the cluster)
- **Rust & Cargo:** [Install Rust](https://rust-lang.org/tools/install) (edition 2024)
- **Docker:** [Install Docker](https://docs.docker.com/get-docker/) (required for the local K3D cluster)
- **kubectl:** [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (optional, for inspecting the cluster)
## 1. Install Harmony
First, clone the Harmony repository and build the project. This gives you the `harmony` CLI and all the core libraries.
## Step 1: Clone and build
```bash
# Clone the main repository
# Clone the repository
git clone https://git.nationtech.io/nationtech/harmony
cd harmony
# Build the project (this may take a few minutes)
# Build the project (this may take a few minutes on first run)
cargo build --release
```
...
## Step 2: Run the PostgreSQL example
## Next Steps
```bash
cargo run -p example-postgresql
```
Congratulations, you've just deployed an application using true infrastructure-as-code!
Harmony will output its progress as it:
From here, you can:
1. **Creates a K3D cluster** named `harmony-postgres-example` (first run only)
2. **Installs the CloudNativePG operator** into the cluster
3. **Creates a PostgreSQL cluster** with 1 instance and 1 GiB of storage
4. **Prints connection details** for your new database
- [Explore the Catalogs](../catalogs/README.md): See what other [Scores](../catalogs/scores.md) and [Topologies](../catalogs/topologies.md) are available.
- [Read the Use Cases](../use-cases/README.md): Check out the [OKD on Bare Metal](./use-cases/okd-on-bare-metal.md) guide for a more advanced scenario.
- [Write your own Score](../guides/writing-a-score.md): Dive into the [Developer Guide](./guides/developer-guide.md) to start building your own components.
Expected output (abbreviated):
```
[+] Cluster created
[+] Installing CloudNativePG operator
[+] Creating PostgreSQL cluster
[+] PostgreSQL cluster is ready
Namespace: harmony-postgres-example
Service: harmony-postgres-example-rw
Username: postgres
Password: <stored in secret harmony-postgres-example-db-user>
```
## Step 3: Verify the deployment
Check that the PostgreSQL pods are running:
```bash
kubectl get pods -n harmony-postgres-example
```
You should see something like:
```
NAME READY STATUS RESTARTS AGE
harmony-postgres-example-1 1/1 Running 0 2m
```
Get the database password:
```bash
kubectl get secret -n harmony-postgres-example harmony-postgres-example-db-user -o jsonpath='{.data.password}' | base64 -d
```
## Step 4: Connect to the database
Forward the PostgreSQL port to your local machine:
```bash
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 5432:5432
```
In another terminal, connect with `psql`:
```bash
psql -h localhost -p 5432 -U postgres
# Enter the password from Step 4 when prompted
```
Try a simple query:
```sql
SELECT version();
```
## Step 5: Clean up
To delete the PostgreSQL cluster and the local K3D cluster:
```bash
k3d cluster delete harmony-postgres-example
```
Alternatively, just delete the PostgreSQL cluster without removing K3D:
```bash
kubectl delete namespace harmony-postgres-example
```
## How it works
The example code (`examples/postgresql/src/main.rs`) is straightforward:
```rust
use harmony::{
inventory::Inventory,
modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig},
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(postgres)],
None,
)
.await
.unwrap();
}
```
- **`Inventory::autoload()`** discovers the local environment (or uses an existing inventory)
- **`K8sAnywhereTopology::from_env()`** connects to K3D if `HARMONY_AUTOINSTALL=true` (the default), or to any Kubernetes cluster via `KUBECONFIG`
- **`harmony_cli::run(...)`** executes the Score against the Topology, managing the full lifecycle
## Connecting to an existing cluster
By default, Harmony provisions a local K3D cluster. To use an existing Kubernetes cluster instead:
```bash
export KUBECONFIG=/path/to/your/kubeconfig
export HARMONY_USE_LOCAL_K3D=false
export HARMONY_AUTOINSTALL=false
cargo run -p example-postgresql
```
## Troubleshooting
### Docker is not running
```
Error: could not create cluster: docker is not running
```
Start Docker and try again.
### K3D cluster creation fails
```
Error: failed to create k3d cluster
```
Ensure you have at least 2 CPU cores and 4 GiB of RAM available for Docker.
### `kubectl` cannot connect to the cluster
```
error: unable to connect to a kubernetes cluster
```
After Harmony creates the cluster, it writes the kubeconfig to `~/.kube/config` or to the path in `KUBECONFIG`. Verify:
```bash
kubectl cluster-info --context k3d-harmony-postgres-example
```
### Port forward fails
```
error: unable to forward port
```
Make sure no other process is using port 5432, or use a different local port:
```bash
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 15432:5432
psql -h localhost -p 15432 -U postgres
```
## Next steps
- [Explore the Scores Catalog](../catalogs/scores.md): See what other Scores are available
- [Explore the Topologies Catalog](../catalogs/topologies.md): See what infrastructure Topologies are supported
- [Read the Core Concepts](../concepts.md): Understand the Score / Topology / Interpret pattern in depth
- [OKD on Bare Metal](../use-cases/okd-on-bare-metal.md): See a complete bare-metal deployment example
## Advanced examples
Once you're comfortable with the basics, these examples demonstrate more advanced use cases. Note that some require specific infrastructure (existing Kubernetes clusters, bare-metal hardware, or multi-cluster environments):
| Example | Description | Prerequisites |
|---------|-------------|---------------|
| `monitoring` | Deploy Prometheus alerting with Discord webhooks | Existing K8s cluster |
| `ntfy` | Deploy ntfy notification server | Existing K8s cluster |
| `tenant` | Create a multi-tenant namespace with quotas | Existing K8s cluster |
| `cert_manager` | Provision TLS certificates | Existing K8s cluster |
| `validate_ceph_cluster_health` | Check Ceph cluster health | Existing Rook/Ceph cluster |
| `okd_pxe` / `okd_installation` | Provision OKD on bare metal | HAClusterTopology, bare-metal hardware |
To run any example:
```bash
cargo run -p example-<example_name>
```

View File

@@ -0,0 +1,158 @@
# Ingress Resources in Harmony
Harmony generates standard Kubernetes `networking.k8s.io/v1` Ingress resources. This ensures your deployments are portable across any Kubernetes distribution (vanilla K8s, OKD/OpenShift, K3s, etc.) without requiring vendor-specific configurations.
By default, Harmony does **not** set `spec.ingressClassName`. This allows the cluster's default ingress controller to automatically claim the resource, which is the correct approach for most single-controller clusters.
---
## TLS Configurations
There are two portable TLS modes for Ingress resources. Use only these in your Harmony deployments.
### 1. Plain HTTP (No TLS)
Omit the `tls` block entirely. The Ingress serves traffic over plain HTTP. Use this for local development or when TLS is terminated elsewhere (e.g., by a service mesh or external load balancer).
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app
namespace: my-ns
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 8080
```
### 2. HTTPS with a Named TLS Secret
Provide a `tls` block with both `hosts` and a `secretName`. The ingress controller will use that Secret for TLS termination. The Secret must be a `kubernetes.io/tls` type in the same namespace as the Ingress.
There are two ways to provide this Secret.
#### Option A: Manual Secret
Create the TLS Secret yourself before deploying the Ingress. This is suitable when certificates are issued outside the cluster or managed by another system.
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app
namespace: my-ns
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 8080
tls:
- hosts:
- app.example.com
secretName: app-example-com-tls
```
#### Option B: Automated via cert-manager (Recommended)
Add the `cert-manager.io/cluster-issuer` annotation to the Ingress. cert-manager will automatically perform the ACME challenge, generate the certificate, store it in the named Secret, and handle renewal. You do not create the Secret yourself.
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app
namespace: my-ns
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 8080
tls:
- hosts:
- app.example.com
secretName: app-example-com-tls
```
If you use a namespace-scoped `Issuer` instead of a `ClusterIssuer`, replace the annotation with `cert-manager.io/issuer: <name>`.
---
## Do Not Use: TLS Without `secretName`
Avoid TLS entries that omit `secretName`:
```yaml
# ⚠️ Non-portable — do not use
tls:
- hosts:
- app.example.com
```
Behavior for this pattern is **controller-specific and not portable**. On OKD/OpenShift, the ingress-to-route translation rejects it as incomplete. On other controllers, it may silently serve a self-signed fallback or fail in unpredictable ways. Harmony does not support this pattern.
---
## Prerequisites for cert-manager
To use automated certificates (Option B above):
1. **cert-manager** must be installed on the cluster.
2. A `ClusterIssuer` or `Issuer` must exist. A typical Let's Encrypt production issuer:
```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: team@example.com
privateKeySecretRef:
name: letsencrypt-prod-account-key
solvers:
- http01:
ingress: {}
```
3. **DNS must already resolve** to the cluster's ingress endpoint before the Ingress is created. The HTTP01 challenge requires this routing to be active.
For wildcard certificates (e.g. `*.example.com`), HTTP01 cannot be used — configure a DNS01 solver with credentials for your DNS provider instead.
---
## OKD / OpenShift Notes
On OKD, standard Ingress resources are automatically translated into OpenShift `Route` objects. The default TLS termination mode is `edge`, which is correct for most HTTP applications. To control this explicitly, add:
```yaml
annotations:
route.openshift.io/termination: edge # or passthrough / reencrypt
```
This annotation is ignored on non-OpenShift clusters and is safe to include unconditionally.

View File

@@ -0,0 +1,211 @@
# Writing a Score
A `Score` declares _what_ you want to achieve. It is decoupled from _how_ it is achieved — that logic lives in an `Interpret`.
## The Pattern
A Score consists of two parts:
1. **A struct** — holds the configuration for your desired state
2. **A `Score<T>` implementation** — returns an `Interpret` that knows how to execute
An `Interpret` contains the actual execution logic and connects your Score to the capabilities exposed by a `Topology`.
## Example: A Simple Score
Here's a simplified version of `NtfyScore` from the `ntfy` module:
```rust
use async_trait::async_trait;
use harmony::{
interpret::{Interpret, InterpretError, Outcome},
inventory::Inventory,
score::Score,
topology::{HelmCommand, K8sclient, Topology},
};
/// MyScore declares "I want to install the ntfy server"
#[derive(Debug, Clone)]
pub struct MyScore {
pub namespace: String,
pub host: String,
}
impl<T: Topology + HelmCommand + K8sclient> Score<T> for MyScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(MyInterpret { score: self.clone() })
}
fn name(&self) -> String {
"ntfy [MyScore]".into()
}
}
/// MyInterpret knows _how_ to install ntfy using the Topology's capabilities
#[derive(Debug)]
pub struct MyInterpret {
pub score: MyScore,
}
#[async_trait]
impl<T: Topology + HelmCommand + K8sclient> Interpret<T> for MyInterpret {
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
// 1. Get a Kubernetes client from the Topology
let client = topology.k8s_client().await?;
// 2. Use Helm to install the ntfy chart
// (via topology's HelmCommand capability)
// 3. Wait for the deployment to be ready
client
.wait_until_deployment_ready("ntfy", Some(&self.score.namespace), None)
.await?;
Ok(Outcome::success("ntfy installed".to_string()))
}
}
```
## The Compile-Time Safety Check
The generic `Score<T>` trait is bounded by `T: Topology`. This means the compiler enforces that your Score only runs on Topologies that expose the capabilities your Interpret needs:
```rust
// This only compiles if K8sAnywhereTopology (or any T)
// implements HelmCommand and K8sclient
impl<T: Topology + HelmCommand + K8sclient> Score<T> for MyScore { ... }
```
If you try to run this Score against a Topology that doesn't expose `HelmCommand`, you get a compile error — before any code runs.
## Using Your Score
Once defined, your Score integrates with the Harmony CLI:
```rust
use harmony::{
inventory::Inventory,
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
let my_score = MyScore {
namespace: "monitoring".to_string(),
host: "ntfy.example.com".to_string(),
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(my_score)],
None,
)
.await
.unwrap();
}
```
## Key Patterns
### Composing Scores
Scores can include other Scores via features:
```rust
let app = ApplicationScore {
features: vec![
Box::new(PackagingDeployment { application: app.clone() }),
Box::new(Monitoring { application: app.clone(), alert_receiver: vec![] }),
],
application: app,
};
```
### Reusing Interpret Logic
Many Scores delegate to shared `Interpret` implementations. For example, `HelmChartScore` provides a reusable Interpret for any Helm-based deployment. Your Score can wrap it:
```rust
impl<T: Topology + HelmCommand> Score<T> for MyScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(HelmChartInterpret { /* your config */ })
}
}
```
### Accessing Topology Capabilities
Your Interpret accesses infrastructure through Capabilities exposed by the Topology:
```rust
// Via the Topology trait directly
let k8s_client = topology.k8s_client().await?;
let helm = topology.get_helm_command();
// Or via Capability traits
impl<T: Topology + K8sclient> Interpret<T> for MyInterpret {
async fn execute(...) {
let client = topology.k8s_client().await?;
// use client...
}
}
```
## Design Principles
### Capabilities are industry concepts, not tools
A capability trait must represent a **standard infrastructure need** that could be fulfilled by multiple tools. The developer who writes a Score should not need to know which product provides the capability.
Good capabilities: `DnsServer`, `LoadBalancer`, `DhcpServer`, `CertificateManagement`, `Router`
These are industry-standard concepts. OPNsense provides `DnsServer` via Unbound; a future topology could provide it via CoreDNS or AWS Route53. The Score doesn't care.
The one exception is when the developer fundamentally needs to know the implementation: `PostgreSQL` is a capability (not `Database`) because the developer writes PostgreSQL-specific SQL, replication configs, and connection strings. Swapping it for MariaDB would break the application, not just the infrastructure.
**Test:** If you could swap the underlying tool without breaking any Score that uses the capability, you've drawn the boundary correctly. If swapping would require rewriting Scores, the capability is too tool-specific.
### One Score per concern, one capability per concern
A Score should express a single infrastructure intent. A capability should expose a single infrastructure concept.
If you're building a deployment that combines multiple concerns (e.g., "deploy Zitadel" requires PostgreSQL + Helm + K8s + Ingress), the Score **declares all of them as trait bounds** and the Topology provides them:
```rust
impl<T: Topology + K8sclient + HelmCommand + PostgreSQL> Score<T> for ZitadelScore
```
If you're building a tool that provides multiple capabilities (e.g., OpenBao provides secret storage, KV versioning, JWT auth, policy management), each capability should be a **separate trait** that can be implemented independently. This way, a Score that only needs secret storage doesn't pull in JWT auth machinery.
### Scores encapsulate operational complexity
The value of a Score is turning tribal knowledge into compiled, type-checked infrastructure. The `ZitadelScore` knows that you need to create a namespace, deploy a PostgreSQL cluster via CNPG, wait for the cluster to be ready, create a masterkey secret, generate a secure admin password, detect the K8s distribution, build distribution-specific Helm values, and deploy the chart. A developer using it writes:
```rust
let zitadel = ZitadelScore { host: "sso.example.com".to_string(), ..Default::default() };
```
Move procedural complexity into opinionated Scores. This makes them easy to test against various topologies (k3d, OpenShift, kubeadm, bare metal) and easy to compose in high-level examples.
### Scores must be idempotent
Running a Score twice should produce the same result as running it once. Use create-or-update semantics, check for existing state before acting, and handle "already exists" responses gracefully.
### Scores must not depend on other Scores running first
A Score declares its capability requirements via trait bounds. It does **not** assume that another Score has run before it. If your Score needs PostgreSQL, it declares `T: PostgreSQL` and lets the Topology handle whether PostgreSQL needs to be installed first.
If you find yourself writing "run Score A, then run Score B", consider whether Score B should declare the capability that Score A provides, or whether both should be orchestrated by a higher-level Score that composes them.
## Best Practices
- **Keep Scores focused** — one Score per concern (deployment, monitoring, networking)
- **Use `..Default::default()`** for optional fields so callers only need to specify what they care about
- **Return `Outcome`** — use `Outcome::success`, `Outcome::failure`, or `Outcome::success_with_details` to communicate results clearly
- **Handle errors gracefully** — return meaningful `InterpretError` messages that help operators debug issues
- **Design capabilities around the developer's need** — not around the tool that fulfills it. Ask: "what is the core need that leads a developer to use this tool?"
- **Don't name capabilities after tools** — `SecretVault` not `OpenbaoStore`, `IdentityProvider` not `ZitadelAuth`

View File

@@ -0,0 +1,176 @@
# Writing a Topology
A `Topology` models your infrastructure environment and exposes `Capability` traits that Scores use to interact with it. Where a Score declares _what_ you want, a Topology exposes _what_ it can do.
## The Minimum Implementation
At minimum, a Topology needs:
```rust
use async_trait::async_trait;
use harmony::{
topology::{PreparationError, PreparationOutcome, Topology},
};
#[derive(Debug, Clone)]
pub struct MyTopology {
pub name: String,
}
#[async_trait]
impl Topology for MyTopology {
fn name(&self) -> &str {
"MyTopology"
}
async fn ensure_ready(&self) -> Result<PreparationOutcome, PreparationError> {
// Verify the infrastructure is accessible and ready
Ok(PreparationOutcome::Success { details: "ready".to_string() })
}
}
```
## Implementing Capabilities
Scores express dependencies on Capabilities through trait bounds. For example, if your Topology should support Scores that deploy Helm charts, implement `HelmCommand`:
```rust
use std::process::Command;
use harmony::topology::HelmCommand;
impl HelmCommand for MyTopology {
fn get_helm_command(&self) -> Command {
let mut cmd = Command::new("helm");
if let Some(kubeconfig) = &self.kubeconfig {
cmd.arg("--kubeconfig").arg(kubeconfig);
}
cmd
}
}
```
For Scores that need a Kubernetes client, implement `K8sclient`:
```rust
use std::sync::Arc;
use harmony_k8s::K8sClient;
use harmony::topology::K8sclient;
#[async_trait]
impl K8sclient for MyTopology {
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
let client = if let Some(kubeconfig) = &self.kubeconfig {
K8sClient::from_kubeconfig(kubeconfig).await?
} else {
K8sClient::try_default().await?
};
Ok(Arc::new(client))
}
}
```
## Loading Topology from Environment
For flexibility, implement `from_env()` to read configuration from environment variables:
```rust
impl MyTopology {
pub fn from_env() -> Self {
Self {
name: std::env::var("MY_TOPOLOGY_NAME")
.unwrap_or_else(|_| "default".to_string()),
kubeconfig: std::env::var("KUBECONFIG").ok(),
}
}
}
```
This pattern lets operators switch between environments without recompiling:
```bash
export KUBECONFIG=/path/to/prod-cluster.kubeconfig
cargo run --example my_example
```
## Complete Example: K8sAnywhereTopology
The `K8sAnywhereTopology` is the most commonly used Topology and handles both local (K3D) and remote Kubernetes clusters:
```rust
pub struct K8sAnywhereTopology {
pub k8s_state: Arc<OnceCell<K8sState>>,
pub tenant_manager: Arc<OnceCell<TenantManager>>,
pub config: Arc<K8sAnywhereConfig>,
}
#[async_trait]
impl Topology for K8sAnywhereTopology {
fn name(&self) -> &str {
"K8sAnywhereTopology"
}
async fn ensure_ready(&self) -> Result<PreparationOutcome, PreparationError> {
// 1. If autoinstall is enabled and no cluster exists, provision K3D
// 2. Verify kubectl connectivity
// 3. Optionally wait for cluster operators to be ready
Ok(PreparationOutcome::Success { details: "cluster ready".to_string() })
}
}
```
## Key Patterns
### Lazy Initialization
Use `OnceCell` for expensive resources like Kubernetes clients:
```rust
pub struct K8sAnywhereTopology {
k8s_state: Arc<OnceCell<K8sState>>,
}
```
### Multi-Target Topologies
For Scores that span multiple clusters (like NATS supercluster), implement `MultiTargetTopology`:
```rust
pub trait MultiTargetTopology: Topology {
fn current_target(&self) -> &str;
fn set_target(&mut self, target: &str);
}
```
### Composing Topologies
Complex topologies combine multiple infrastructure components:
```rust
pub struct HAClusterTopology {
pub router: Arc<dyn Router>,
pub load_balancer: Arc<dyn LoadBalancer>,
pub firewall: Arc<dyn Firewall>,
pub dhcp_server: Arc<dyn DhcpServer>,
pub dns_server: Arc<dyn DnsServer>,
pub kubeconfig: Option<String>,
// ...
}
```
## Testing Your Topology
Test Topologies in isolation by implementing them against mock infrastructure:
```rust
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_topology_ensure_ready() {
let topo = MyTopology::from_env();
let result = topo.ensure_ready().await;
assert!(result.is_ok());
}
}
```

View File

@@ -1,443 +0,0 @@
# Monitoring and Alerting in Harmony
Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.
## Overview
Harmony's monitoring module supports three distinct use cases:
| Level | Who Uses It | What It Provides |
|-------|-------------|------------------|
| **Cluster** | Cluster administrators | Full control over monitoring stack, cluster-wide alerts, external scrape targets |
| **Tenant** | Platform teams | Namespace-scoped monitoring in multi-tenant environments |
| **Application** | Application developers | Zero-config monitoring that "just works" |
Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.
## Core Concepts
### AlertSender
An `AlertSender` represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:
| Sender | Description | Use When |
|--------|-------------|----------|
| `OpenshiftClusterAlertSender` | OKD/OpenShift built-in monitoring | Running on OKD/OpenShift |
| `KubePrometheus` | kube-prometheus-stack via Helm | Standard Kubernetes, need full stack |
| `Prometheus` | Standalone Prometheus | Custom Prometheus deployment |
| `RedHatClusterObservability` | RHOB operator | Red Hat managed clusters |
| `Grafana` | Grafana-managed alerting | Grafana as primary alerting layer |
### AlertReceiver
An `AlertReceiver` defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.
```rust
pub trait AlertReceiver<S: AlertSender> {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
}
```
Built-in receivers:
- `DiscordReceiver` - Discord webhooks
- `WebhookReceiver` - Generic HTTP webhooks
### AlertRule
An `AlertRule` defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.
```rust
pub trait AlertRule<S: AlertSender> {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
}
```
### Observability Capability
Topologies implement `Observability<S>` to indicate they support a specific alert sender:
```rust
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
```
This provides **compile-time verification**: if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't implement `Observability<OpenshiftClusterAlertSender>`, the code won't compile.
---
## Level 1: Cluster Monitoring
Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:
- Installing/managing the monitoring stack
- Configuring cluster-wide alert receivers
- Defining cluster-level alert rules
- Adding external scrape targets (e.g., bare-metal servers, firewalls)
### Example: OKD Cluster Alerts
```rust
use harmony::{
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
},
topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};
let severity_matcher = AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
};
let rule_group = AlertManagerRuleGroup::new(
"cluster-rules",
vec![high_pvc_fill_rate_over_two_days()],
);
let external_exporter = PrometheusNodeExporter {
job_name: "firewall".to_string(),
metrics_path: "/metrics".to_string(),
listen_address: ip!("192.168.1.1"),
port: 9100,
..Default::default()
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver {
name: "critical-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/..."),
route: AlertRoute {
matchers: vec![severity_matcher],
..AlertRoute::default("critical-alerts".to_string())
},
})],
rules: vec![Box::new(rule_group)],
scrape_targets: Some(vec![Box::new(external_exporter)]),
})],
None,
).await?;
```
### What This Does
1. **Enables cluster monitoring** - Activates OKD's built-in Prometheus
2. **Enables user workload monitoring** - Allows namespace-scoped rules
3. **Configures Alertmanager** - Adds Discord receiver with route matching
4. **Deploys alert rules** - Creates `AlertingRule` CRD with PVC fill rate alert
5. **Adds external scrape target** - Configures Prometheus to scrape the firewall
### Compile-Time Safety
The `OpenshiftClusterAlertScore` requires:
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore
```
If `K8sAnywhereTopology` didn't implement `Observability<OpenshiftClusterAlertSender>`, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.
---
## Level 2: Tenant Monitoring
In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:
- Resources are deployed in the tenant's namespace
- Cannot modify cluster-level monitoring configuration
- The topology determines namespace context at runtime
### How It Works
The topology's `Observability` implementation handles tenant scoping:
```rust
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender, inventory, rules) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or_else(|| "monitoring".to_string());
// Rules are installed in the appropriate namespace
for rule in rules.unwrap_or_default() {
let score = KubePrometheusRuleScore {
sender: sender.clone(),
rule,
namespace: namespace.clone(), // Tenant namespace
};
score.create_interpret().execute(inventory, self).await?;
}
}
}
```
### Tenant vs Cluster Resources
| Resource | Cluster-Level | Tenant-Level |
|----------|---------------|--------------|
| Alertmanager config | Global receivers | Namespaced receivers (where supported) |
| PrometheusRules | Cluster-wide alerts | Namespace alerts only |
| ServiceMonitors | Any namespace | Own namespace only |
| External scrape targets | Can add | Cannot add (cluster config) |
### Runtime Validation
Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.
This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.
---
## Level 3: Application Monitoring
Application monitoring provides zero-config, opinionated monitoring for developers. Just add the `Monitoring` feature to your application and it works.
### Example
```rust
use harmony::modules::{
application::{Application, ApplicationFeature},
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};
// Define your application
let my_app = MyApplication::new();
// Add monitoring as a feature
let monitoring = Monitoring {
application: Arc::new(my_app),
alert_receiver: vec![], // Uses defaults
};
// Install with the application
my_app.add_feature(monitoring);
```
### What Application Monitoring Provides
1. **Automatic ServiceMonitor** - Creates a ServiceMonitor for your application's pods
2. **Ntfy Notification Channel** - Auto-installs and configures Ntfy for push notifications
3. **Tenant Awareness** - Automatically scopes to the correct namespace
4. **Sensible Defaults** - Pre-configured alert routes and receivers
### Under the Hood
```rust
impl<T: Topology + Observability<Prometheus> + TenantManager>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// 1. Get tenant namespace (or use app name)
let namespace = topology.get_tenant_config().await
.map(|ns| ns.name.clone())
.unwrap_or_else(|| self.application.name());
// 2. Create ServiceMonitor for the app
let app_service_monitor = ServiceMonitor {
metadata: ObjectMeta {
name: Some(self.application.name()),
namespace: Some(namespace.clone()),
..Default::default()
},
spec: ServiceMonitorSpec::default(),
};
// 3. Install Ntfy for notifications
let ntfy = NtfyScore { namespace, host };
ntfy.interpret(&Inventory::empty(), topology).await?;
// 4. Wire up webhook receiver to Ntfy
let ntfy_receiver = WebhookReceiver { ... };
// 5. Execute monitoring score
alerting_score.interpret(&Inventory::empty(), topology).await?;
}
}
```
---
## Pre-Built Alert Rules
Harmony provides a library of common alert rules in `modules/monitoring/alert_rule/alerts/`:
### Kubernetes Alerts (`alerts/k8s/`)
```rust
use harmony::modules::monitoring::alert_rule::alerts::k8s::{
pod::pod_failed,
pvc::high_pvc_fill_rate_over_two_days,
memory_usage::alert_high_memory_usage,
};
let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
pod_failed(),
high_pvc_fill_rate_over_two_days(),
alert_high_memory_usage(),
]);
```
Available rules:
- `pod_failed()` - Pod in failed state
- `alert_container_restarting()` - Container restart loop
- `alert_pod_not_ready()` - Pod not ready for extended period
- `high_pvc_fill_rate_over_two_days()` - PVC will fill within 2 days
- `alert_high_memory_usage()` - Memory usage above threshold
- `alert_high_cpu_usage()` - CPU usage above threshold
### Infrastructure Alerts (`alerts/infra/`)
```rust
use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;
let rules = AlertManagerRuleGroup::new("infra-rules", vec![
high_http_error_rate(),
]);
```
### Creating Custom Rules
```rust
use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "My service is down")
.annotation("description", "The my-service job has been down for more than 5 minutes")
}
```
---
## Alert Receivers
### Discord Webhook
```rust
use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};
let discord = DiscordReceiver {
name: "ops-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
route: AlertRoute {
receiver: "ops-alerts".to_string(),
matchers: vec![AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
}],
group_by: vec!["alertname".to_string()],
repeat_interval: Some("30m".to_string()),
continue_matching: false,
children: vec![],
},
};
```
### Generic Webhook
```rust
use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;
let webhook = WebhookReceiver {
name: "custom-webhook".to_string(),
url: hurl!("https://api.example.com/alerts"),
route: AlertRoute::default("custom-webhook".to_string()),
};
```
---
## Adding a New Monitoring Stack
To add support for a new monitoring stack:
1. **Create the sender type** in `modules/monitoring/my_sender/mod.rs`:
```rust
#[derive(Debug, Clone)]
pub struct MySender;
impl AlertSender for MySender {
fn name(&self) -> String { "MySender".to_string() }
}
```
2. **Define CRD types** in `modules/monitoring/my_sender/crd/`:
```rust
#[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
#[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
pub struct MyAlertRuleSpec { ... }
```
3. **Implement Observability** in `domain/topology/k8s_anywhere/observability/my_sender.rs`:
```rust
impl Observability<MySender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
```
4. **Implement receiver conversions** for existing receivers:
```rust
impl AlertReceiver<MySender> for DiscordReceiver {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
// Convert DiscordReceiver to MySender's format
}
}
```
5. **Create score types**:
```rust
pub struct MySenderAlertScore {
pub sender: MySender,
pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
pub rules: Vec<Box<dyn AlertRule<MySender>>>,
}
```
---
## Architecture Principles
### Type Safety Over Flexibility
Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.
### Compile-Time Capability Verification
The `Observability<S>` bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.
### Explicit Over Implicit
Monitoring stacks are chosen explicitly (`OpenshiftClusterAlertSender` vs `KubePrometheus`). There's no "auto-detection" that could lead to surprising behavior.
### Three Levels, One Foundation
Cluster, tenant, and application monitoring all use the same traits (`AlertSender`, `AlertReceiver`, `AlertRule`). The difference is in how scores are constructed and how topologies interpret them.
---
## Related Documentation
- [ADR-020: Monitoring and Alerting Architecture](../adr/020-monitoring-alerting-architecture.md)
- [ADR-013: Monitoring Notifications (ntfy)](../adr/013-monitoring-notifications.md)
- [ADR-011: Multi-Tenant Cluster Architecture](../adr/011-multi-tenant-cluster.md)
- [Coding Guide](coding-guide.md)
- [Core Concepts](concepts.md)

16
docs/one_liners.md Normal file
View File

@@ -0,0 +1,16 @@
# Handy one liners for infrastructure management
### Delete all evicted pods from a cluster
```sh
kubectl get po -A | grep Evic | awk '{ print "-n " $1 " " $2 }' | xargs -L 1 kubectl delete po
```
> Pods are evicted when the node they are running on lacks the ressources to keep them going. The most common case is when ephemeral storage becomes too full because of something like a log file getting too big.
>
> It could also happen because of memory or cpu pressure due to unpredictable workloads.
>
> This means it is generally ok to delete them.
>
> However, in a perfectly configured deployment and cluster, pods should rarely, if ever, get evicted. For example, a log file getting too big should be reconfigured not to use too much space, or the deployment should be configured to reserve the correct amount of ephemeral storage space.
>
> Note that deleting evicted pods do not solve the underlying issue, make sure to understand why the pod was evicted in the first place and put the proper solution in place.

21
docs/use-cases/README.md Normal file
View File

@@ -0,0 +1,21 @@
# Use Cases
Real-world scenarios demonstrating Harmony in action.
## Available Use Cases
### [OPNsense VM Integration](./opnsense-vm-integration.md)
Boot a real OPNsense firewall in a local KVM VM and configure it entirely through Harmony — load balancer, DHCP, TFTP, VLANs, firewall rules, NAT, VIPs, and link aggregation. Fully automated, zero manual steps. The best way to see Harmony in action.
### [PostgreSQL on Local K3D](./postgresql-on-local-k3d.md)
Deploy a fully functional PostgreSQL cluster on a local K3D cluster in under 10 minutes. The quickest way to see Harmony's Kubernetes capabilities.
### [OKD on Bare Metal](./okd-on-bare-metal.md)
A complete walkthrough of bootstrapping a high-availability OKD cluster from physical hardware. Covers inventory discovery, bootstrap, control plane, and worker provisioning.
---
_These use cases are community-tested scenarios. For questions or contributions, open an issue on the [Harmony repository](https://git.nationtech.io/NationTech/harmony/issues)._

View File

@@ -0,0 +1,159 @@
# Use Case: OKD on Bare Metal
Provision a production-grade OKD (OpenShift Kubernetes Distribution) cluster from physical hardware using Harmony. This use case covers the full lifecycle: hardware discovery, bootstrap, control plane, workers, and post-install validation.
## What you'll have at the end
A highly-available OKD cluster with:
- 3 control plane nodes
- 2+ worker nodes
- Network bonding configured on nodes and switches
- Load balancer routing API and ingress traffic
- DNS and DHCP services for the cluster
- Post-install health validation
## Target hardware model
This setup assumes a typical lab environment:
```
┌─────────────────────────────────────────────────────────┐
│ Network 192.168.x.0/24 (flat, DHCP + PXE capable) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ cp0 │ │ cp1 │ │ cp2 │ (control) │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ wk0 │ │ wk1 │ ... (workers) │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ │
│ │ bootstrap│ (temporary, can be repurposed) │
│ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ firewall │ │ switch │ (OPNsense + Brocade) │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
```
## Required infrastructure
Harmony models this as an `HAClusterTopology`, which requires these capabilities:
| Capability | Implementation |
|------------|---------------|
| **Router** | OPNsense firewall |
| **Load Balancer** | OPNsense HAProxy |
| **Firewall** | OPNsense |
| **DHCP Server** | OPNsense |
| **TFTP Server** | OPNsense |
| **HTTP Server** | OPNsense |
| **DNS Server** | OPNsense |
| **Node Exporter** | Prometheus node_exporter on OPNsense |
| **Switch Client** | Brocade SNMP |
See `examples/okd_installation/` for a reference topology implementation.
## The Provisioning Pipeline
Harmony orchestrates OKD installation in ordered stages:
### Stage 1: Inventory Discovery (`OKDSetup01InventoryScore`)
Harmony boots all nodes via PXE into a CentOS Stream live environment, runs an inventory agent on each, and collects:
- MAC addresses and NIC details
- IP addresses assigned by DHCP
- Hardware profile (CPU, RAM, storage)
This is the "discovery-first" approach: no pre-configuration required on nodes.
### Stage 2: Bootstrap Node (`OKDSetup02BootstrapScore`)
The user selects one discovered node to serve as the bootstrap node. Harmony:
- Renders per-MAC iPXE boot configuration with OKD 4.19 SCOS live assets + ignition
- Reboots the bootstrap node via SSH
- Waits for the bootstrap process to complete (API server becomes available)
### Stage 3: Control Plane (`OKDSetup03ControlPlaneScore`)
With bootstrap complete, Harmony provisions the control plane nodes:
- Renders per-MAC iPXE for each control plane node
- Reboots via SSH and waits for node to join the cluster
- Applies network bond configuration via NMState MachineConfig where relevant
### Stage 4: Network Bonding (`OKDSetupPersistNetworkBondScore`)
Configures LACP bonds on nodes and corresponding port-channels on the switch stack for high-availability.
### Stage 5: Worker Nodes (`OKDSetup04WorkersScore`)
Provisions worker nodes similarly to control plane, joining them to the cluster.
### Stage 6: Sanity Check (`OKDSetup05SanityCheckScore`)
Validates:
- API server is reachable
- Ingress controller is operational
- Cluster operators are healthy
- SDN (software-defined networking) is functional
### Stage 7: Installation Report (`OKDSetup06InstallationReportScore`)
Produces a machine-readable JSON report and human-readable summary of the installation.
## Network notes
**During discovery:** Ports must be in access mode (no LACP). DHCP succeeds; iPXE loads CentOS Stream live with Kickstart and starts the inventory endpoint.
**During provisioning:** After SCOS is on disk and Ignition/MachineConfig can be applied, bonds are set persistently. This avoids the PXE/DHCP recovery race condition that occurs if bonding is configured too early.
**PXE limitation:** The generic discovery path cannot use bonded networks for PXE boot because the DHCP recovery process conflicts with bond formation.
## Configuration knobs
When using `OKDInstallationPipeline`, configure these domains:
| Parameter | Example | Description |
|-----------|---------|-------------|
| `public_domain` | `apps.example.com` | Wildcard domain for application ingress |
| `internal_domain` | `cluster.local` | Internal cluster DNS domain |
## Running the example
See `examples/okd_installation/` for a complete reference. The topology must be configured with your infrastructure details:
```bash
# Configure the example with your hardware/network specifics
# See examples/okd_installation/src/topology.rs
cargo run -p example-okd_installation
```
This example requires:
- Physical hardware configured as described above
- OPNsense firewall with SSH access
- Brocade switch with SNMP access
- All nodes connected to the same Layer 2 network
## Post-install
After the cluster is bootstrapped, `~/.kube/config` is updated with the cluster credentials. Verify:
```bash
kubectl get nodes
kubectl get pods -n openshift-monitoring
oc get routes -n openshift-console
```
## Next steps
- Enable monitoring with `PrometheusAlertScore` or `OpenshiftClusterAlertScore`
- Configure TLS certificates with `CertManagerHelmScore`
- Add storage with Rook Ceph
- Scale workers with `OKDSetup04WorkersScore`
## Further reading
- [OKD Installation Module](../../harmony/src/modules/okd/installation.rs) — source of truth for pipeline stages
- [HAClusterTopology](../../harmony/src/domain/topology/ha_cluster.rs) — infrastructure capability model
- [Scores Catalog](../catalogs/scores.md) — all available Scores including OKD-specific ones

View File

@@ -0,0 +1,234 @@
# Use Case: OPNsense VM Integration
Boot a real OPNsense firewall in a local KVM virtual machine and configure it entirely through Harmony — load balancer, DHCP, TFTP, VLANs, firewall rules, NAT, VIPs, and link aggregation. Fully automated, zero manual steps, CI-friendly.
This is the best way to discover Harmony: you'll see 11 different Scores configure a production firewall through type-safe Rust code and the OPNsense REST API.
## What you'll have at the end
A local OPNsense VM fully configured by Harmony with:
- HAProxy load balancer with health-checked backends
- DHCP server with static host bindings and PXE boot options
- TFTP server serving boot files
- Prometheus node exporter enabled
- 2 VLANs on the LAN interface
- Firewall filter rules, outbound NAT, and bidirectional NAT
- Virtual IPs (IP aliases)
- Port forwarding (DNAT) rules
- LAGG interface (link aggregation)
All applied idempotently through the OPNsense REST API — the same Scores used in production bare-metal deployments.
## Prerequisites
- **Linux** with KVM support (Intel VT-x/AMD-V enabled in BIOS)
- **libvirt + QEMU** installed and running (`libvirtd` service active)
- **~10 GB** free disk space
- **~15 minutes** for the first run (image download + OPNsense firmware update)
- Docker running (if installed — the setup handles compatibility)
Supported distributions: Arch, Manjaro, Fedora, Ubuntu, Debian.
## Quick start (single command)
```bash
# One-time: install libvirt and configure permissions
./examples/opnsense_vm_integration/setup-libvirt.sh
newgrp libvirt
# Verify
cargo run -p opnsense-vm-integration -- --check
# Boot + bootstrap + run all 11 Scores (fully unattended)
cargo run -p opnsense-vm-integration -- --full
```
That's it. No browser clicks, no manual SSH configuration, no wizard interaction.
## What happens step by step
### Phase 1: Boot the VM
Downloads the OPNsense 26.1 nano image (~350 MB, cached after first run), injects a `config.xml` with virtio NIC assignments, creates a 4 GiB qcow2 disk, and boots the VM with 4 NICs:
```
vtnet0 = LAN (192.168.1.1/24) -- management
vtnet1 = WAN (DHCP) -- internet access
vtnet2 = LAGG member 1 -- for aggregation test
vtnet3 = LAGG member 2 -- for aggregation test
```
### Phase 2: Automated bootstrap
Once the web UI responds (~20 seconds after boot), `OPNsenseBootstrap` takes over:
1. **Logs in** to the web UI (root/opnsense) with automatic CSRF token handling
2. **Aborts the initial setup wizard** via the OPNsense API
3. **Enables SSH** with root login and password authentication
4. **Changes the web GUI port** to 9443 (prevents HAProxy conflicts on standard ports)
5. **Restarts lighttpd** via SSH to apply the port change
No browser, no Playwright, no expect scripts — just HTTP requests with session cookies and SSH commands.
### Phase 3: Run 11 Scores
Creates an API key via SSH, then configures the entire firewall:
| # | Score | What it configures |
|---|-------|--------------------|
| 1 | `LoadBalancerScore` | HAProxy with 2 frontends (ports 16443 and 18443), backends with health checks |
| 2 | `DhcpScore` | DHCP range, 2 static host bindings (MAC-to-IP), PXE boot options |
| 3 | `TftpScore` | TFTP server serving PXE boot files |
| 4 | `NodeExporterScore` | Prometheus node exporter on OPNsense |
| 5 | `VlanScore` | 2 test VLANs (tags 100 and 200) on vtnet0 |
| 6 | `FirewallRuleScore` | Firewall filter rules (allow/block with logging) |
| 7 | `OutboundNatScore` | Source NAT rule for outbound traffic |
| 8 | `BinatScore` | Bidirectional 1:1 NAT |
| 9 | `VipScore` | Virtual IPs (IP aliases for CARP/HA) |
| 10 | `DnatScore` | Port forwarding rules |
| 11 | `LaggScore` | Link aggregation group (failover on vtnet2+vtnet3) |
Each Score reports its status:
```
[LoadBalancerScore] SUCCESS in 2.2s -- Load balancer configured 2 services
[DhcpScore] SUCCESS in 1.4s -- Dhcp Interpret execution successful
[VlanScore] SUCCESS in 0.2s -- Configured 2 VLANs
...
PASSED -- All OPNsense integration tests successful
```
### Phase 4: Verify
After all Scores run, the integration test verifies each configuration via the REST API:
- HAProxy has 2+ frontends
- Dnsmasq has 2+ static hosts and a DHCP range
- TFTP is enabled
- Node exporter is enabled
- 2+ VLANs exist
- Firewall filter rules are present
- VIPs, DNAT, BINAT, SNAT rules are configured
- LAGG interface exists
## Explore in the web UI
After the test completes, open https://192.168.1.1:9443 (login: root/opnsense) and explore:
- **Services > HAProxy > Settings** -- frontends, backends, servers with health checks
- **Services > Dnsmasq DNS > Settings** -- host overrides (static DHCP entries)
- **Services > TFTP** -- enabled with uploaded files
- **Interfaces > Other Types > VLAN** -- two tagged VLANs
- **Firewall > Automation > Filter** -- filter rules created by Harmony
- **Firewall > NAT > Port Forward** -- DNAT rules
- **Firewall > NAT > Outbound** -- SNAT rules
- **Firewall > NAT > One-to-One** -- BINAT rules
- **Interfaces > Virtual IPs > Settings** -- IP aliases
- **Interfaces > Other Types > LAGG** -- link aggregation group
## Clean up
```bash
cargo run -p opnsense-vm-integration -- --clean
```
Destroys the VM and virtual networks. The cached OPNsense image is kept for next time.
## How it works
### Architecture
```
Your workstation OPNsense VM (KVM)
+--------------------+ +---------------------+
| Harmony | | OPNsense 26.1 |
| +---------------+ | REST API | +---------------+ |
| | OPNsense |----(HTTPS:9443)---->| | API + Plugins | |
| | Scores | | | +---------------+ |
| +---------------+ | SSH | +---------------+ |
| +---------------+ |----(port 22)----->| | FreeBSD Shell | |
| | OPNsense- | | | +---------------+ |
| | Bootstrap | | HTTP session | |
| +---------------+ |----(HTTPS:443)--->| (first-boot only) |
| +---------------+ | | |
| | opnsense- | | | LAN: 192.168.1.1 |
| | config | | | WAN: DHCP |
| +---------------+ | +---------------------+
+--------------------+
```
The stack has four layers:
1. **`opnsense-api`** -- auto-generated typed Rust client from OPNsense XML model files
2. **`opnsense-config`** -- high-level configuration modules (DHCP, firewall, load balancer, etc.)
3. **`OPNsenseBootstrap`** -- first-boot automation via HTTP session auth (login, wizard, SSH, webgui port)
4. **Harmony Scores** -- declarative desired-state descriptions that make the firewall match
### The Score pattern
```rust
// 1. Declare desired state
let score = VlanScore {
vlans: vec![
VlanDef { parent: "vtnet0", tag: 100, description: "management" },
VlanDef { parent: "vtnet0", tag: 200, description: "storage" },
],
};
// 2. Execute against topology -- queries current state, applies diff
score.interpret(&inventory, &topology).await?;
// Output: [VlanScore] SUCCESS in 0.9s -- Created 2 VLANs
```
Scores are idempotent: running the same Score twice produces the same result.
## Network architecture
```
Host (192.168.1.10) --- virbr-opn bridge --- OPNsense LAN (192.168.1.1)
192.168.1.0/24 vtnet0
NAT to internet
--- virbr0 (default) --- OPNsense WAN (DHCP)
192.168.122.0/24 vtnet1
NAT to internet
```
## Available commands
| Command | Description |
|---------|-------------|
| `--check` | Verify prerequisites (libvirtd, virsh, qemu-img) |
| `--download` | Download the OPNsense image (cached) |
| `--boot` | Create VM + automated bootstrap |
| (default) | Run integration test (assumes VM is bootstrapped) |
| `--full` | Boot + bootstrap + integration test (CI mode) |
| `--status` | Show VM state, ports, and connectivity |
| `--clean` | Destroy VM and networks |
## Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `RUST_LOG` | (unset) | Log level: `info`, `debug`, `trace` |
| `HARMONY_KVM_URI` | `qemu:///system` | Libvirt connection URI |
| `HARMONY_KVM_IMAGE_DIR` | `~/.local/share/harmony/kvm/images` | Cached disk images |
## Troubleshooting
**VM won't start / permission denied**
Ensure your user is in the `libvirt` group and that the image directory is traversable by the qemu user. Run `setup-libvirt.sh` to fix.
**192.168.1.0/24 conflict**
If your host network already uses this subnet, the VM will be unreachable. Edit the constants in `src/main.rs` to use a different subnet.
**Web GUI didn't come up after bootstrap**
The bootstrap runs `diagnose_via_ssh()` automatically when the web UI doesn't respond. Check the diagnostic output for lighttpd status and listening ports. You can also access the serial console: `virsh -c qemu:///system console opn-integration`
**HAProxy install fails**
OPNsense may need a firmware update. The integration test handles this automatically but it may take a few minutes for the update + reboot cycle.
## What's next
- **[OPNsense Firewall Pair](../../examples/opnsense_pair_integration/README.md)** -- boot two VMs, configure CARP HA failover with `FirewallPairTopology` and `CarpVipScore`. Uses NIC link control to bootstrap both VMs sequentially despite sharing the same default IP.
- [OKD on Bare Metal](./okd-on-bare-metal.md) -- the full 7-stage OKD installation pipeline using OPNsense as the infrastructure backbone
- [PostgreSQL on Local K3D](./postgresql-on-local-k3d.md) -- a simpler starting point using Kubernetes

View File

@@ -0,0 +1,115 @@
# Use Case: PostgreSQL on Local K3D
Deploy a production-grade PostgreSQL cluster on a local Kubernetes cluster (K3D) using Harmony. This is the fastest way to get started with Harmony and requires no external infrastructure.
## What you'll have at the end
A fully operational PostgreSQL cluster with:
- 1 primary instance with 1 GiB of storage
- CloudNativePG operator managing the cluster lifecycle
- Automatic failover support (foundation for high-availability)
- Exposed as a Kubernetes Service for easy connection
## Prerequisites
- Rust 2024 edition
- Docker running locally
- ~5 minutes
## The Score
The entire deployment is expressed in ~20 lines of Rust:
```rust
use harmony::{
inventory::Inventory,
modules::postgresql::{PostgreSQLScore, capability::PostgreSQLConfig},
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "harmony-postgres-example".to_string(),
namespace: "harmony-postgres-example".to_string(),
..Default::default()
},
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(postgres)],
None,
)
.await
.unwrap();
}
```
## What Harmony does
When you run this, Harmony:
1. **Connects to K8sAnywhereTopology** — this auto-provisions a K3D cluster if none exists
2. **Installs the CloudNativePG operator** — one-time setup that enables PostgreSQL cluster management in Kubernetes
3. **Creates a PostgreSQL cluster** — Harmony translates the Score into a `Cluster` CRD and applies it
4. **Exposes the database** — creates a Kubernetes Service for the PostgreSQL primary
## Running it
```bash
cargo run -p example-postgresql
```
## Verifying the deployment
```bash
# Check pods
kubectl get pods -n harmony-postgres-example
# Get the password
PASSWORD=$(kubectl get secret -n harmony-postgres-example \
harmony-postgres-example-db-user \
-o jsonpath='{.data.password}' | base64 -d)
# Connect via port-forward
kubectl port-forward -n harmony-postgres-example svc/harmony-postgres-example-rw 5432:5432
psql -h localhost -p 5432 -U postgres -W "$PASSWORD"
```
## Customizing the deployment
The `PostgreSQLConfig` struct supports:
| Field | Default | Description |
|-------|---------|-------------|
| `cluster_name` | — | Name of the PostgreSQL cluster |
| `namespace` | — | Kubernetes namespace to deploy to |
| `instances` | `1` | Number of instances |
| `storage_size` | `1Gi` | Persistent storage size per instance |
Example with custom settings:
```rust
let postgres = PostgreSQLScore {
config: PostgreSQLConfig {
cluster_name: "my-prod-db".to_string(),
namespace: "database".to_string(),
instances: 3,
storage_size: "10Gi".to_string().into(),
..Default::default()
},
};
```
## Extending the pattern
This pattern extends to any Kubernetes-native workload:
- Add **monitoring** by including a `Monitoring` feature alongside your Score
- Add **TLS certificates** by including a `CertificateScore`
- Add **tenant isolation** by wrapping in a `TenantScore`
See [Scores Catalog](../catalogs/scores.md) for the full list.

131
examples/README.md Normal file
View File

@@ -0,0 +1,131 @@
# Examples
This directory contains runnable examples demonstrating Harmony's capabilities. Each example is a self-contained program that can be run with `cargo run -p example-<name>`.
## Quick Reference
| Example | Description | Local K3D | Existing Cluster | Hardware Needed |
|---------|-------------|:---------:|:----------------:|:---------------:|
| `postgresql` | Deploy a PostgreSQL cluster | ✅ | ✅ | — |
| `ntfy` | Deploy ntfy notification server | ✅ | ✅ | — |
| `tenant` | Create a multi-tenant namespace | ✅ | ✅ | — |
| `cert_manager` | Provision TLS certificates | ✅ | ✅ | — |
| `node_health` | Check Kubernetes node health | ✅ | ✅ | — |
| `monitoring` | Deploy Prometheus alerting | ✅ | ✅ | — |
| `monitoring_with_tenant` | Monitoring + tenant isolation | ✅ | ✅ | — |
| `operatorhub_catalog` | Install OperatorHub catalog | ✅ | ✅ | — |
| `validate_ceph_cluster_health` | Verify Ceph cluster health | — | ✅ | Rook/Ceph |
| `remove_rook_osd` | Remove a Rook OSD | — | ✅ | Rook/Ceph |
| `brocade_snmp_server` | Configure Brocade switch SNMP | — | ✅ | Brocade switch |
| `opnsense_node_exporter` | Node exporter on OPNsense | — | ✅ | OPNsense firewall |
| `opnsense_vm_integration` | Full OPNsense firewall automation (11 Scores) | ✅ | — | KVM/libvirt |
| `opnsense_pair_integration` | OPNsense HA pair with CARP failover | ✅ | — | KVM/libvirt |
| `okd_pxe` | PXE boot configuration for OKD | — | — | ✅ |
| `okd_installation` | Full OKD bare-metal install | — | — | ✅ |
| `okd_cluster_alerts` | OKD cluster monitoring alerts | — | ✅ | OKD cluster |
| `multisite_postgres` | Multi-site PostgreSQL failover | — | ✅ | Multi-cluster |
| `nats` | Deploy NATS messaging | — | ✅ | Multi-cluster |
| `nats-supercluster` | NATS supercluster across sites | — | ✅ | Multi-cluster |
| `lamp` | LAMP stack deployment | ✅ | ✅ | — |
| `openbao` | Deploy OpenBao vault | ✅ | ✅ | — |
| `zitadel` | Deploy Zitadel identity provider | ✅ | ✅ | — |
| `try_rust_webapp` | Rust webapp with packaging | ✅ | ✅ | Submodule |
| `rust` | Rust webapp with full monitoring | ✅ | ✅ | — |
| `rhob_application_monitoring` | RHOB monitoring setup | ✅ | ✅ | — |
| `sttest` | Full OKD stack test | — | — | ✅ |
| `application_monitoring_with_tenant` | App monitoring + tenant | — | ✅ | OKD cluster |
| `kube-rs` | Direct kube-rs client usage | ✅ | ✅ | — |
| `k8s_drain_node` | Drain a Kubernetes node | ✅ | ✅ | — |
| `k8s_write_file_on_node` | Write files to K8s nodes | ✅ | ✅ | — |
| `harmony_inventory_builder` | Discover hosts via subnet scan | ✅ | — | — |
| `cli` | CLI tool with inventory discovery | ✅ | — | — |
| `tui` | Terminal UI demonstration | ✅ | — | — |
## Status Legend
| Symbol | Meaning |
|--------|---------|
| ✅ | Works out-of-the-box |
| — | Not applicable or requires specific setup |
## By Category
### Data Services
- **`postgresql`** — Deploy a PostgreSQL cluster via CloudNativePG
- **`multisite_postgres`** — Multi-site PostgreSQL with failover
- **`public_postgres`** — Public-facing PostgreSQL (⚠️ uses NationTech DNS)
### Kubernetes Utilities
- **`node_health`** — Check node health in a cluster
- **`k8s_drain_node`** — Drain and reboot a node
- **`k8s_write_file_on_node`** — Write files to nodes
- **`validate_ceph_cluster_health`** — Verify Ceph/Rook cluster health
- **`remove_rook_osd`** — Remove an OSD from Rook/Ceph
- **`kube-rs`** — Direct Kubernetes client usage demo
### Monitoring & Alerting
- **`monitoring`** — Deploy Prometheus alerting with Discord webhooks
- **`monitoring_with_tenant`** — Monitoring with tenant isolation
- **`ntfy`** — Deploy ntfy notification server
- **`okd_cluster_alerts`** — OKD-specific cluster alerts
### Application Deployment
- **`try_rust_webapp`** — Deploy a Rust webapp with packaging (⚠️ requires `tryrust.org` submodule)
- **`rust`** — Rust webapp with full monitoring features
- **`rhob_application_monitoring`** — Red Hat Observability Stack monitoring
- **`lamp`** — LAMP stack deployment (⚠️ uses NationTech DNS)
- **`application_monitoring_with_tenant`** — App monitoring with tenant isolation
### Infrastructure & Bare Metal
- **`opnsense_vm_integration`** — **Recommended demo.** Boot an OPNsense VM and configure it with 11 Scores (load balancer, DHCP, TFTP, VLANs, firewall rules, NAT, VIPs, LAGG). Fully automated, requires only KVM. See the [detailed guide](../docs/use-cases/opnsense-vm-integration.md).
- **`opnsense_pair_integration`** — Boot two OPNsense VMs and configure a CARP HA firewall pair with `FirewallPairTopology` and `CarpVipScore`. Demonstrates NIC link control for sequential bootstrap.
- **`okd_installation`** — Full OKD cluster from scratch
- **`okd_pxe`** — PXE boot configuration for OKD
- **`sttest`** — Full OKD stack test with specific hardware
- **`brocade_snmp_server`** — Configure Brocade switch via SNMP
- **`opnsense_node_exporter`** — Node exporter on OPNsense firewall
### Multi-Cluster
- **`nats`** — NATS deployment on a cluster
- **`nats-supercluster`** — NATS supercluster across multiple sites
- **`multisite_postgres`** — PostgreSQL with multi-site failover
### Identity & Secrets
- **`openbao`** — Deploy OpenBao vault (⚠️ uses NationTech DNS)
- **`zitadel`** — Deploy Zitadel identity provider (⚠️ uses NationTech DNS)
### Cluster Services
- **`cert_manager`** — Provision TLS certificates
- **`tenant`** — Create a multi-tenant namespace
- **`operatorhub_catalog`** — Install OperatorHub catalog sources
### Development & Testing
- **`cli`** — CLI tool with inventory discovery
- **`tui`** — Terminal UI demonstration
- **`harmony_inventory_builder`** — Host discovery via subnet scan
## Running Examples
```bash
# Build first
cargo build --release
# Run any example
cargo run -p example-postgresql
cargo run -p example-ntfy
cargo run -p example-tenant
```
For examples that need an existing Kubernetes cluster:
```bash
export KUBECONFIG=/path/to/your/kubeconfig
export HARMONY_USE_LOCAL_K3D=false
export HARMONY_AUTOINSTALL=false
cargo run -p example-monitoring
```
## Notes on Private Infrastructure
Some examples use NationTech-hosted infrastructure by default (DNS domains like `*.nationtech.io`, `*.harmony.mcd`). These are not suitable for public use without modification. See the [Getting Started Guide](../docs/guides/getting-started.md) for the recommended public examples.

View File

@@ -7,7 +7,7 @@ use harmony::{
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
tenant::TenantScore,
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute, tenant::TenantConfig},
topology::{K8sAnywhereTopology, tenant::TenantConfig},
};
use harmony_types::id::Id;
use harmony_types::net::Url;
@@ -33,14 +33,9 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "sample-webhook-receiver".to_string();
let webhook_receiver = WebhookReceiver {
name: receiver_name.clone(),
name: "sample-webhook-receiver".to_string(),
url: Url::Url(url::Url::parse("https://webhook-doesnt-exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
};
let app = ApplicationScore {

View File

@@ -0,0 +1,15 @@
[package]
name = "example_linux_vm"
version.workspace = true
edition = "2024"
license.workspace = true
[[bin]]
name = "example_linux_vm"
path = "src/main.rs"
[dependencies]
harmony = { path = "../../harmony" }
tokio.workspace = true
log.workspace = true
env_logger.workspace = true

View File

@@ -0,0 +1,43 @@
# Example: Linux VM from ISO
This example deploys a simple Linux virtual machine from an ISO URL.
## What it creates
- One isolated virtual network (`linuxvm-net`, 192.168.101.0/24)
- One Ubuntu Server VM with the ISO attached as a CD-ROM
- The VM is configured to boot from the CD-ROM first, allowing installation
- After installation, the VM can be rebooted to boot from disk
## Prerequisites
- A running KVM hypervisor (local or remote)
- `HARMONY_KVM_URI` environment variable pointing to the hypervisor (defaults to `qemu:///system`)
- `HARMONY_KVM_IMAGE_DIR` environment variable for storing VM images (defaults to harmony data dir)
## Usage
```bash
cargo run -p example_linux_vm
```
## After deployment
Once the VM is running, you can connect to its console:
```bash
virsh -c qemu:///system console linux-vm
```
To access the VM via SSH after installation, you'll need to configure a bridged network or port forwarding.
## Clean up
To remove the VM and network:
```bash
virsh -c qemu:///system destroy linux-vm
virsh -c qemu:///system undefine linux-vm
virsh -c qemu:///system net-destroy linuxvm-net
virsh -c qemu:///system net-undefine linuxvm-net
```

Some files were not shown because too many files have changed in this diff Show More