harmony/adr/011-multi-tenant-cluster.md
Jean-Gabriel Gill-Couture 5127f44ab3
All checks were successful
Run Check Script / check (push) Successful in 1m47s
Run Check Script / check (pull_request) Successful in 1m46s
docs: Add note about pod privilege escalation in ADR 011 Tenant
2025-06-06 13:56:40 -04:00

9.5 KiB

Architecture Decision Record: Multi-Tenancy Strategy for Harmony Managed Clusters

Initial Author: Jean-Gabriel Gill-Couture

Initial Date: 2025-05-26

Status

Proposed

Context

Harmony manages production OKD/Kubernetes clusters that serve multiple clients with varying trust levels and operational requirements. We need a multi-tenancy strategy that provides:

  1. Strong isolation between client workloads while maintaining operational simplicity
  2. Controlled API access allowing clients self-service capabilities within defined boundaries
  3. Security-first approach protecting both the cluster infrastructure and tenant data
  4. Harmony-native implementation using our Score/Interpret pattern for automated tenant provisioning
  5. Scalable management supporting both small trusted clients and larger enterprise customers

The official Kubernetes multi-tenancy documentation identifies two primary models: namespace-based isolation and virtual control planes per tenant. Given Harmony's focus on operational simplicity, provider-agnostic abstractions (ADR-003), and hexagonal architecture (ADR-002), we must choose an approach that balances security, usability, and maintainability.

Our clients represent a hybrid tenancy model:

  • Customer multi-tenancy: Each client operates independently with no cross-tenant trust
  • Team multi-tenancy: Individual clients may have multiple team members requiring coordinated access
  • API access requirement: Unlike pure SaaS scenarios, clients need controlled Kubernetes API access for self-service operations

The official kubernetes documentation on multi tenancy heavily inspired this ADR : https://kubernetes.io/docs/concepts/security/multi-tenancy/

Decision

Implement namespace-based multi-tenancy with the following architecture:

1. Network Security Model

  • Private cluster access: Kubernetes API and OpenShift console accessible only via WireGuard VPN
  • No public exposure: Control plane endpoints remain internal to prevent unauthorized access attempts
  • VPN-based authentication: Initial access control through WireGuard client certificates

2. Tenant Isolation Strategy

  • Dedicated namespace per tenant: Each client receives an isolated namespace with access limited only to the required resources and operations
  • Complete network isolation: NetworkPolicies prevent cross-namespace communication while allowing full egress to public internet
  • Resource governance: ResourceQuotas and LimitRanges enforce CPU, memory, and storage consumption limits
  • Storage access control: Clients can create PersistentVolumeClaims but cannot directly manipulate PersistentVolumes or access other tenants' storage

3. Access Control Framework

  • Principle of Least Privilege: RBAC grants only necessary permissions within tenant namespace scope
  • Namespace-scoped: Clients can create/modify/delete resources within their namespace
  • Cluster-level restrictions: No access to cluster-wide resources, other namespaces, or sensitive cluster operations
  • Whitelisted operations: Controlled self-service capabilities for ingress, secrets, configmaps, and workload management

4. Identity Management Evolution

  • Phase 1: Manual provisioning of VPN access and Kubernetes ServiceAccounts/Users
  • Phase 2: Migration to Keycloak-based identity management (aligning with ADR-006) for centralized authentication and lifecycle management

5. Harmony Integration

  • TenantScore implementation: Declarative tenant provisioning using Harmony's Score/Interpret pattern
  • Topology abstraction: Tenant configuration abstracted from underlying Kubernetes implementation details
  • Automated deployment: Complete tenant setup automated through Harmony's orchestration capabilities

Rationale

Network Security Through VPN Access

  • Defense in depth: VPN requirement adds critical security layer preventing unauthorized cluster access
  • Simplified firewall rules: No need for complex public endpoint protections or rate limiting
  • Audit capability: VPN access provides clear audit trail of cluster connections
  • Aligns with enterprise practices: Most enterprise customers already use VPN infrastructure

Namespace Isolation vs Virtual Control Planes

Following Kubernetes official guidance, namespace isolation provides:

  • Lower resource overhead: Virtual control planes require dedicated etcd, API server, and controller manager per tenant
  • Operational simplicity: Single control plane to maintain, upgrade, and monitor
  • Cross-tenant service integration: Enables future controlled cross-tenant communication if required
  • Proven stability: Namespace-based isolation is well-tested and widely deployed
  • Cost efficiency: Significantly lower infrastructure costs compared to dedicated control planes

Hybrid Tenancy Model Suitability

Our approach addresses both customer and team multi-tenancy requirements:

  • Customer isolation: Strong network and RBAC boundaries prevent cross-tenant interference
  • Team collaboration: Multiple team members can share namespace access through group-based RBAC
  • Self-service balance: Controlled API access enables client autonomy without compromising security

Harmony Architecture Alignment

  • Provider agnostic: TenantScore abstracts multi-tenancy concepts, enabling future support for other Kubernetes distributions
  • Hexagonal architecture: Tenant management becomes an infrastructure capability accessed through well-defined ports
  • Declarative automation: Tenant lifecycle fully managed through Harmony's Score execution model

Consequences

Positive Consequences

  • Strong security posture: VPN + namespace isolation provides robust tenant separation
  • Operational efficiency: Single cluster management with automated tenant provisioning
  • Client autonomy: Self-service capabilities reduce operational support burden
  • Scalable architecture: Can support hundreds of tenants per cluster without architectural changes
  • Future flexibility: Foundation supports evolution to more sophisticated multi-tenancy models
  • Cost optimization: Shared infrastructure maximizes resource utilization

Negative Consequences

  • VPN operational overhead: Requires VPN infrastructure management
  • Manual provisioning complexity: Phase 1 manual user management creates administrative burden
  • Network policy dependency: Requires CNI with NetworkPolicy support (OVN-Kubernetes provides this and is the OKD/Openshift default)
  • Cluster-wide resource limitations: Some advanced Kubernetes features require cluster-wide access
  • Single point of failure: Cluster outage affects all tenants simultaneously

Migration Challenges

  • Legacy client integration: Existing clients may need VPN client setup and credential migration
  • Monitoring complexity: Per-tenant observability requires careful metric and log segmentation
  • Backup considerations: Tenant data backup must respect isolation boundaries

Alternatives Considered

Alternative 1: Virtual Control Plane Per Tenant

Pros: Complete control plane isolation, full Kubernetes API access per tenant Cons: 3-5x higher resource usage, complex cross-tenant networking, operational complexity scales linearly with tenants

Rejected: Resource overhead incompatible with cost-effective multi-tenancy goals

Alternative 2: Dedicated Clusters Per Tenant

Pros: Maximum isolation, independent upgrade cycles, simplified security model Cons: Exponential operational complexity, prohibitive costs, resource waste

Rejected: Operational overhead makes this approach unsustainable for multiple clients

Alternative 3: Public API with Advanced Authentication

Pros: No VPN requirement, potentially simpler client access Cons: Larger attack surface, complex rate limiting and DDoS protection, increased security monitoring requirements

Rejected: Risk/benefit analysis favors VPN-based access control

Alternative 4: Service Mesh Based Isolation

Pros: Fine-grained traffic control, encryption, advanced observability Cons: Significant operational complexity, performance overhead, steep learning curve

Rejected: Complexity overhead outweighs benefits for current requirements; remains option for future enhancement

Additional Notes

Implementation Roadmap

  1. Phase 1: Implement VPN access and manual tenant provisioning
  2. Phase 2: Deploy TenantScore automation for namespace, RBAC, and NetworkPolicy management
  3. Phase 3: Work on privilege escalation from pods, audit for weaknesses, enforce security policies on pod runtimes
  4. Phase 4: Integrate Keycloak for centralized identity management
  5. Phase 5: Add advanced monitoring and per-tenant observability

TenantScore Structure Preview

pub struct TenantScore {
    pub tenant_config: TenantConfig,
    pub resource_quotas: ResourceQuotaConfig,
    pub network_isolation: NetworkIsolationPolicy,
    pub storage_access: StorageAccessConfig,
    pub rbac_config: RBACConfig,
}

Future Enhancements

  • Cross-tenant service mesh: For approved inter-tenant communication
  • Advanced monitoring: Per-tenant Prometheus/Grafana instances
  • Backup automation: Tenant-scoped backup policies
  • Cost allocation: Detailed per-tenant resource usage tracking

This ADR establishes the foundation for secure, scalable multi-tenancy in Harmony-managed clusters while maintaining operational simplicity and cost effectiveness. A follow-up ADR will detail the Tenant abstraction and user management mechanisms within the Harmony framework.