Production Scaling and High Availability
This guide explains how to configure an existing Upbound Space deployment for production operation at scale.
Use this guide when you're ready to deploy production scaling, high availability, and monitoring in your Space.
Prerequisites
Before you begin scaling your Spaces deployment, make sure you have:
- A working Space deployment
- Cluster administrator access
- An understanding of load patterns and growth in your organization
- A familiarity with node affinity, tainting, and Horizontal Pod Autoscaling (HPA)
Production scaling strategy
In this guide, you will:
- Create dedicated node pools for different component types
- Configure high-availability to ensure there are no single points of failure
- Set dynamic scaling for variable workloads
- Optimize your storage and component operations
- Monitor your deployment health and performance
Spaces architecture
The basic Spaces workflow follows the pattern below:
Node architecture
You can mitigate resource contention and improve reliability by separating system components into dedicated node pools.
etcd
dedicated nodes
etcd
performance directly impacts your entire Space, so isolate it for
consistent performance.
-
Create a dedicated
etcd
node poolRequirements:
- Minimum: 3 nodes for HA
- Instance type: General purpose with high network throughput/low latency
- Storage: High performance storage (
etcd
is I/O sensitive)
-
Taint
etcd
nodes to reserve themkubectl taint nodes <etcd-node> target=etcd:NoSchedule
-
Configure
etcd
storageetcd
is sensitive to storage I/O performance. Review theetcd
scaling documentation for specific storage guidance.
API server dedicated nodes
API servers handle all control plane requests and should run on dedicated infrastructure.
-
Create dedicated API server nodes
Requirements:
- Minimum: 2 nodes for HA
- Instance type: Compute-optimized, memory-optimized, or general-purpose
- Scaling: Scale vertically based on API server load patterns
-
Taint API server nodes
kubectl taint nodes <api-server-node> target=apiserver:NoSchedule
Configure cluster autoscaling
Enable cluster autoscaling for all node pools.
For AWS EKS clusters, Upbound recommends using Karpenter
for
improved bin-packing and instance type selection.
For GCP GKE clusters, follow the GKE autoscaling guide.
For Azure AKS clusters, follow the AKS autoscaling guide.
Configure high availability
Ensure control plane components can survive node and zone failures.
Enable high availability mode
-
Configure control planes for high availability
controlPlanes:
ha:
enabled: trueThis configures control plane pods to run with multiple replicas and associated pod disruption budgets.
Configure component distribution
-
Set up API server pod distribution
controlPlanes:
vcluster:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: target
operator: In
values:
- apiserver
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vcluster
topologyKey: "kubernetes.io/hostname"
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vcluster
topologyKey: topology.kubernetes.io/zone
weight: 100 -
Configure
etcd
pod distributioncontrolPlanes:
etcd:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: target
operator: In
values:
- etcd
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vcluster-etcd
topologyKey: "kubernetes.io/hostname"
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vcluster-etcd
topologyKey: topology.kubernetes.io/zone
weight: 100
Configure autoscaling for Spaces components
Set up the Spaces system components to handle variable load automatically.
Scale API and apollo
services
-
Configure minimum replicas for availability
api:
replicaCount: 2
features:
alpha:
apollo:
enabled: true
replicaCount: 2Both services support horizontal and vertical scaling based on load patterns.
Configure router autoscaling
The spaces-router
is the entry point for all traffic and needs intelligent
scaling.
-
Enable Horizontal Pod Autoscaler
router:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 8
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80 -
Monitor scaling factors
Router scaling behavior:
- Vertical scaling: Scales based on number of control planes
- Horizontal scaling: Scales based on request volume
- Resource monitoring: Monitor CPU and memory usage
Configure controller scaling
The spaces-controller
manages Space-level resources and requires vertical
scaling.
-
Configure adequate resources with headroom
controller:
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "4Gi"Important: The controller can spike when reconciling large numbers of control planes, so provide adequate headroom for resource spikes.
Set up production storage
Configure Query API database
-
Use a managed PostgreSQL database
Recommended services:
Requirements:
- Minimum 400 IOPS performance
Monitoring
Monitor key metrics to ensure healthy scaling and identify issues quickly.
Control plane health
Track these spaces-controller
metrics:
-
Total control planes
spaces_control_plane_exists
Tracks the total number of control planes in the system.
-
Degraded control planes
spaces_control_plane_degraded
Returns control planes that don't have a
Synced
,Ready
, andHealthy
state. -
Stuck control planes
spaces_control_plane_stuck
Control planes stuck in a provisioning state.
-
Deletion issues
spaces_control_plane_deletion_stuck
Control planes stuck during deletion.
Alerting
Configure alerts for critical scaling and health metrics:
- High error rates: Alert when 4xx/5xx response rates exceed thresholds
- Control plane health: Alert when degraded or stuck control planes exceed acceptable counts
Architecture overview
Spaces System Components:
spaces-router
: Entry point for all endpoints, dynamically builds routes to control plane API serversspaces-controller
: Reconciles Space-level resources, serves webhooks, works withmxp-controller
for provisioningspaces-api
: API for managing groups, control planes, shared secrets, and telemetry objects (accessed only through spaces-router)spaces-apollo
: Hosts the Query API, connects to PostgreSQL database populated byapollo-syncer
pods
Control Plane Components (per control plane):
mxp-controller
: Handles provisioning tasks, serves webhooks, installs UXP andXGQL
XGQL
: GraphQL API powering console viewskube-state-metrics
: Collects usage metrics for billing (updated bymxp-controller
when CRDs change)vector
: Works withkube-state-metrics
to send usage data to external storage for billingapollo syncer
: Syncsetcd
data into PostgreSQL for the Query API