Production Scaling and High Availability

This guide explains how to configure an existing Upbound Space deployment for production operation at scale.

Use this guide when you're ready to deploy production scaling, high availability, and monitoring in your Space.

Prerequisites

Before you begin scaling your Spaces deployment, make sure you have:

A working Space deployment
Cluster administrator access
An understanding of load patterns and growth in your organization
A familiarity with node affinity, tainting, and Horizontal Pod Autoscaling (HPA)

Production scaling strategy

In this guide, you will:

Create dedicated node pools for different component types
Configure high-availability to ensure there are no single points of failure
Set dynamic scaling for variable workloads
Optimize your storage and component operations
Monitor your deployment health and performance

Spaces architecture

The basic Spaces workflow follows the pattern below:

Spaces workflow

Node architecture

You can mitigate resource contention and improve reliability by separating system components into dedicated node pools.

`etcd` dedicated nodes

etcd performance directly impacts your entire Space, so isolate it for consistent performance.

Create a dedicated etcd node pool

Requirements:
- Minimum: 3 nodes for HA
- Instance type: General purpose with high network throughput/low latency
- Storage: High performance storage (etcd is I/O sensitive)

Taint etcd nodes to reserve them

kubectl taint nodes <etcd-node> target=etcd:NoSchedule

Configure etcd storage

etcd is sensitive to storage I/O performance. Review the etcd scaling documentation for specific storage guidance.

API server dedicated nodes

API servers handle all control plane requests and should run on dedicated infrastructure.

Create dedicated API server nodes

Requirements:
- Minimum: 2 nodes for HA
- Instance type: Compute-optimized, memory-optimized, or general-purpose
- Scaling: Scale vertically based on API server load patterns

Taint API server nodes

kubectl taint nodes <api-server-node> target=apiserver:NoSchedule

Configure cluster autoscaling

Enable cluster autoscaling for all node pools.

For AWS EKS clusters, Upbound recommends using Karpenter for improved bin-packing and instance type selection.

For GCP GKE clusters, follow the GKE autoscaling guide.

For Azure AKS clusters, follow the AKS autoscaling guide.

Configure high availability

Ensure control plane components can survive node and zone failures.

Enable high availability mode

Configure control planes for high availability
```
controlPlanes:
  ha:
    enabled: true
```
This configures control plane pods to run with multiple replicas and associated pod disruption budgets.

Configure component distribution

Set up API server pod distribution

controlPlanes:
  vcluster:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: target
                  operator: In
                  values:
                    - apiserver
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - vcluster
          topologyKey: "kubernetes.io/hostname"
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - vcluster
            topologyKey: topology.kubernetes.io/zone
          weight: 100

Configure etcd pod distribution

controlPlanes:
  etcd:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: target
                  operator: In
                  values:
                    - etcd
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - vcluster-etcd
          topologyKey: "kubernetes.io/hostname"
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - vcluster-etcd
            topologyKey: topology.kubernetes.io/zone
          weight: 100

Configure autoscaling for Spaces components

Set up the Spaces system components to handle variable load automatically.

Scale API and `apollo` services

Configure minimum replicas for availability
```
api:
  replicaCount: 2

features:
  alpha:
    apollo:
      enabled: true
      replicaCount: 2
```
Both services support horizontal and vertical scaling based on load patterns.

Configure router autoscaling

The spaces-router is the entry point for all traffic and needs intelligent scaling.

Enable Horizontal Pod Autoscaler

router:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetCPUUtilizationPercentage: 80
    targetMemoryUtilizationPercentage: 80

Monitor scaling factors

Router scaling behavior:
- Vertical scaling: Scales based on number of control planes
- Horizontal scaling: Scales based on request volume
- Resource monitoring: Monitor CPU and memory usage

Configure controller scaling

The spaces-controller manages Space-level resources and requires vertical scaling.

Configure adequate resources with headroom
```
controller:
  resources:
    requests:
      cpu: "500m" 
      memory: "1Gi"
    limits:
      cpu: "2000m"
      memory: "4Gi"
```
Important: The controller can spike when reconciling large numbers of control planes, so provide adequate headroom for resource spikes.

Set up production storage

Configure Query API database

Use a managed PostgreSQL database

Recommended services:
Requirements:
- Minimum 400 IOPS performance

Monitoring

Monitor key metrics to ensure healthy scaling and identify issues quickly.

Control plane health

Track these spaces-controller metrics:

Total control planes
```
spaces_control_plane_exists
```
Tracks the total number of control planes in the system.
Degraded control planes
```
spaces_control_plane_degraded
```
Returns control planes that don't have a Synced, Ready, and Healthy state.
Stuck control planes
```
spaces_control_plane_stuck
```
Control planes stuck in a provisioning state.
Deletion issues
```
spaces_control_plane_deletion_stuck
```
Control planes stuck during deletion.

Alerting

Configure alerts for critical scaling and health metrics:

High error rates: Alert when 4xx/5xx response rates exceed thresholds
Control plane health: Alert when degraded or stuck control planes exceed acceptable counts

Architecture overview

Spaces System Components:

spaces-router: Entry point for all endpoints, dynamically builds routes to control plane API servers
spaces-controller: Reconciles Space-level resources, serves webhooks, works with mxp-controller for provisioning
spaces-api: API for managing groups, control planes, shared secrets, and telemetry objects (accessed only through spaces-router)
spaces-apollo: Hosts the Query API, connects to PostgreSQL database populated by apollo-syncer pods

Control Plane Components (per control plane):

mxp-controller: Handles provisioning tasks, serves webhooks, installs UXP and XGQL
XGQL: GraphQL API powering console views
kube-state-metrics: Collects usage metrics for billing (updated by mxp-controller when CRDs change)
vector: Works with kube-state-metrics to send usage data to external storage for billing
apollo syncer: Syncs etcd data into PostgreSQL for the Query API

Prerequisites​

Production scaling strategy​

Spaces architecture​

Node architecture​

etcd dedicated nodes​

API server dedicated nodes​

Configure cluster autoscaling​

Configure high availability​

Enable high availability mode​

Configure component distribution​

Configure autoscaling for Spaces components​

Scale API and apollo services​

Configure router autoscaling​

Configure controller scaling​

Set up production storage​

Configure Query API database​

Monitoring​

Control plane health​

Alerting​

Architecture overview​

See also​