9.5 KiB
Architecture Decision Record: Multi-Tenancy Strategy for Harmony Managed Clusters
Initial Author: Jean-Gabriel Gill-Couture
Initial Date: 2025-05-26
Status
Proposed
Context
Harmony manages production OKD/Kubernetes clusters that serve multiple clients with varying trust levels and operational requirements. We need a multi-tenancy strategy that provides:
- Strong isolation between client workloads while maintaining operational simplicity
- Controlled API access allowing clients self-service capabilities within defined boundaries
- Security-first approach protecting both the cluster infrastructure and tenant data
- Harmony-native implementation using our Score/Interpret pattern for automated tenant provisioning
- Scalable management supporting both small trusted clients and larger enterprise customers
The official Kubernetes multi-tenancy documentation identifies two primary models: namespace-based isolation and virtual control planes per tenant. Given Harmony's focus on operational simplicity, provider-agnostic abstractions (ADR-003), and hexagonal architecture (ADR-002), we must choose an approach that balances security, usability, and maintainability.
Our clients represent a hybrid tenancy model:
- Customer multi-tenancy: Each client operates independently with no cross-tenant trust
- Team multi-tenancy: Individual clients may have multiple team members requiring coordinated access
- API access requirement: Unlike pure SaaS scenarios, clients need controlled Kubernetes API access for self-service operations
The official kubernetes documentation on multi tenancy heavily inspired this ADR : https://kubernetes.io/docs/concepts/security/multi-tenancy/
Decision
Implement namespace-based multi-tenancy with the following architecture:
1. Network Security Model
- Private cluster access: Kubernetes API and OpenShift console accessible only via WireGuard VPN
- No public exposure: Control plane endpoints remain internal to prevent unauthorized access attempts
- VPN-based authentication: Initial access control through WireGuard client certificates
2. Tenant Isolation Strategy
- Dedicated namespace per tenant: Each client receives an isolated namespace with access limited only to the required resources and operations
- Complete network isolation: NetworkPolicies prevent cross-namespace communication while allowing full egress to public internet
- Resource governance: ResourceQuotas and LimitRanges enforce CPU, memory, and storage consumption limits
- Storage access control: Clients can create PersistentVolumeClaims but cannot directly manipulate PersistentVolumes or access other tenants' storage
3. Access Control Framework
- Principle of Least Privilege: RBAC grants only necessary permissions within tenant namespace scope
- Namespace-scoped: Clients can create/modify/delete resources within their namespace
- Cluster-level restrictions: No access to cluster-wide resources, other namespaces, or sensitive cluster operations
- Whitelisted operations: Controlled self-service capabilities for ingress, secrets, configmaps, and workload management
4. Identity Management Evolution
- Phase 1: Manual provisioning of VPN access and Kubernetes ServiceAccounts/Users
- Phase 2: Migration to Keycloak-based identity management (aligning with ADR-006) for centralized authentication and lifecycle management
5. Harmony Integration
- TenantScore implementation: Declarative tenant provisioning using Harmony's Score/Interpret pattern
- Topology abstraction: Tenant configuration abstracted from underlying Kubernetes implementation details
- Automated deployment: Complete tenant setup automated through Harmony's orchestration capabilities
Rationale
Network Security Through VPN Access
- Defense in depth: VPN requirement adds critical security layer preventing unauthorized cluster access
- Simplified firewall rules: No need for complex public endpoint protections or rate limiting
- Audit capability: VPN access provides clear audit trail of cluster connections
- Aligns with enterprise practices: Most enterprise customers already use VPN infrastructure
Namespace Isolation vs Virtual Control Planes
Following Kubernetes official guidance, namespace isolation provides:
- Lower resource overhead: Virtual control planes require dedicated etcd, API server, and controller manager per tenant
- Operational simplicity: Single control plane to maintain, upgrade, and monitor
- Cross-tenant service integration: Enables future controlled cross-tenant communication if required
- Proven stability: Namespace-based isolation is well-tested and widely deployed
- Cost efficiency: Significantly lower infrastructure costs compared to dedicated control planes
Hybrid Tenancy Model Suitability
Our approach addresses both customer and team multi-tenancy requirements:
- Customer isolation: Strong network and RBAC boundaries prevent cross-tenant interference
- Team collaboration: Multiple team members can share namespace access through group-based RBAC
- Self-service balance: Controlled API access enables client autonomy without compromising security
Harmony Architecture Alignment
- Provider agnostic: TenantScore abstracts multi-tenancy concepts, enabling future support for other Kubernetes distributions
- Hexagonal architecture: Tenant management becomes an infrastructure capability accessed through well-defined ports
- Declarative automation: Tenant lifecycle fully managed through Harmony's Score execution model
Consequences
Positive Consequences
- Strong security posture: VPN + namespace isolation provides robust tenant separation
- Operational efficiency: Single cluster management with automated tenant provisioning
- Client autonomy: Self-service capabilities reduce operational support burden
- Scalable architecture: Can support hundreds of tenants per cluster without architectural changes
- Future flexibility: Foundation supports evolution to more sophisticated multi-tenancy models
- Cost optimization: Shared infrastructure maximizes resource utilization
Negative Consequences
- VPN operational overhead: Requires VPN infrastructure management
- Manual provisioning complexity: Phase 1 manual user management creates administrative burden
- Network policy dependency: Requires CNI with NetworkPolicy support (OVN-Kubernetes provides this and is the OKD/Openshift default)
- Cluster-wide resource limitations: Some advanced Kubernetes features require cluster-wide access
- Single point of failure: Cluster outage affects all tenants simultaneously
Migration Challenges
- Legacy client integration: Existing clients may need VPN client setup and credential migration
- Monitoring complexity: Per-tenant observability requires careful metric and log segmentation
- Backup considerations: Tenant data backup must respect isolation boundaries
Alternatives Considered
Alternative 1: Virtual Control Plane Per Tenant
Pros: Complete control plane isolation, full Kubernetes API access per tenant Cons: 3-5x higher resource usage, complex cross-tenant networking, operational complexity scales linearly with tenants
Rejected: Resource overhead incompatible with cost-effective multi-tenancy goals
Alternative 2: Dedicated Clusters Per Tenant
Pros: Maximum isolation, independent upgrade cycles, simplified security model Cons: Exponential operational complexity, prohibitive costs, resource waste
Rejected: Operational overhead makes this approach unsustainable for multiple clients
Alternative 3: Public API with Advanced Authentication
Pros: No VPN requirement, potentially simpler client access Cons: Larger attack surface, complex rate limiting and DDoS protection, increased security monitoring requirements
Rejected: Risk/benefit analysis favors VPN-based access control
Alternative 4: Service Mesh Based Isolation
Pros: Fine-grained traffic control, encryption, advanced observability Cons: Significant operational complexity, performance overhead, steep learning curve
Rejected: Complexity overhead outweighs benefits for current requirements; remains option for future enhancement
Additional Notes
Implementation Roadmap
- Phase 1: Implement VPN access and manual tenant provisioning
- Phase 2: Deploy TenantScore automation for namespace, RBAC, and NetworkPolicy management
- Phase 3: Work on privilege escalation from pods, audit for weaknesses, enforce security policies on pod runtimes
- Phase 4: Integrate Keycloak for centralized identity management
- Phase 5: Add advanced monitoring and per-tenant observability
TenantScore Structure Preview
pub struct TenantScore {
pub tenant_config: TenantConfig,
pub resource_quotas: ResourceQuotaConfig,
pub network_isolation: NetworkIsolationPolicy,
pub storage_access: StorageAccessConfig,
pub rbac_config: RBACConfig,
}
Future Enhancements
- Cross-tenant service mesh: For approved inter-tenant communication
- Advanced monitoring: Per-tenant Prometheus/Grafana instances
- Backup automation: Tenant-scoped backup policies
- Cost allocation: Detailed per-tenant resource usage tracking
This ADR establishes the foundation for secure, scalable multi-tenancy in Harmony-managed clusters while maintaining operational simplicity and cost effectiveness. A follow-up ADR will detail the Tenant abstraction and user management mechanisms within the Harmony framework.