Files
harmony/harmony_node_readiness/README.md
2026-03-02 14:55:28 -05:00

215 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# harmony-node-readiness-endpoint
**A lightweight, standalone Rust service for Kubernetes node health checking.**
Designed for **bare-metal Kubernetes clusters** with external load balancers (HAProxy, OPNsense, F5, etc.).
It exposes a simple, reliable HTTP endpoint (`/health`) on each node that returns:
- **200 OK** — node is healthy and ready to receive traffic
- **503 Service Unavailable** — node should be removed from the load balancer pool
This project is **not dependent on Harmony**, but is commonly used as part of Harmony bare-metal Kubernetes deployments.
## Why this project exists
In bare-metal environments, external load balancers often rely on pod-level or router-level checks that can lag behind the authoritative Kubernetes `Node.status.conditions[Ready]`.
This service provides the true source-of-truth with fast reaction time.
## Features & Roadmap
| Check | Description | Status | Check Name |
|------------------------------------|--------------------------------------------------|---------------------|--------------------|
| **Node readiness (API)** | Queries `Node.status.conditions[Ready]` via Kubernetes API | **Implemented** | `node_ready` |
| **OKD Router health** | Probes OpenShift router healthz on port 1936 | **Implemented** | `okd_router_1936` |
| Filesystem readonly | Detects read-only mounts via `/proc/mounts` | To be implemented | `filesystem_ro` |
| Kubelet running | Local probe to kubelet `/healthz` (port 10248) | To be implemented | `kubelet` |
| CRI-O / container runtime health | Socket check + runtime status | To be implemented | `container_runtime`|
| Disk / inode pressure | Threshold checks on key filesystems | To be implemented | `disk_pressure` |
| Network reachability | DNS resolution + gateway connectivity | To be implemented | `network` |
| Custom NodeConditions | Reacts to extra conditions (NPD, etc.) | To be implemented | `custom_conditions`|
All checks are combined with logical **AND** — any failure results in 503.
## How it works
### Node Name Discovery
The service automatically discovers its own node name using the **Kubernetes Downward API**:
```yaml
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
```
### Kubernetes API Authentication
- Uses standard **in-cluster configuration** (no external credentials needed).
- The ServiceAccount token and CA certificate are automatically mounted by Kubernetes at `/var/run/secrets/kubernetes.io/serviceaccount/`.
- The application (via `kube-rs` or your Harmony higher-level client) calls the equivalent of `Config::incluster_config()`.
- Requires only minimal RBAC: `get` permission on the `nodes` resource (see `deploy/rbac.yaml`).
## Quick Start
### 1. Build and push
```bash
cargo build --release --bin harmony-node-readiness-endpoint
docker build -t your-registry/harmony-node-readiness-endpoint:v1.0.0 .
docker push your-registry/harmony-node-readiness-endpoint:v1.0.0
```
### 2. Deploy
```bash
kubectl apply -f deploy/namespace.yaml
kubectl apply -f deploy/rbac.yaml
kubectl apply -f deploy/daemonset.yaml
```
(The DaemonSet uses `hostPort: 25001` by default so the endpoint is reachable directly on the node's IP.)
### 3. Configure your external load balancer
**Example for HAProxy / OPNsense:**
- Check type: **HTTP**
- URI: `/health`
- Port: `25001` (configurable via `LISTEN_PORT`)
- Interval: 510 s
- Rise: 2
- Fall: 3
- Expect: `2xx`
## Health Endpoint Examples
### Query Parameter
Use the `check` query parameter to specify which checks to run. Multiple checks can be comma-separated.
| Request | Behavior |
|--------------------------------------|---------------------------------------------|
| `GET /health` | Runs `node_ready` (default) |
| `GET /health?check=okd_router_1936` | Runs only OKD router check |
| `GET /health?check=node_ready,okd_router_1936` | Runs both checks |
**Note:** When the `check` parameter is provided, only the specified checks run. You must explicitly include `node_ready` if you want it along with other checks.
### Response Format
Each check result includes:
- `name`: The check identifier
- `passed`: Boolean indicating success or failure
- `reason`: (Optional) Failure reason if the check failed
- `duration_ms`: Time taken to execute the check in milliseconds
**Healthy node (default check)**
```http
HTTP/1.1 200 OK
Content-Type: application/json
```
**Healthy node (multiple checks)**
```http
GET /health?check=node_ready,okd_router_1936
HTTP/1.1 200 OK
Content-Type: application/json
```
**Unhealthy node (one check failed)**
```http
GET /health?check=node_ready,okd_router_1936
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
```
**Unhealthy node (default check)**
```http
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
```
## Configuration (via DaemonSet env vars)
```yaml
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: LISTEN_PORT
value: "25001"
```
Checks are selected via the `check` query parameter on the `/health` endpoint. See the usage examples above.
## Development
```bash
# Run locally (set NODE_NAME env var)
NODE_NAME=my-test-node cargo run
```
---
*Minimal, auditable, and built for production bare-metal Kubernetes environments.*