Environment metrics
Upbound recognizes that individual operators have preferences about how to monitor their infrastructure. Metrics for Crossplane, Environments, and Spaces are exposed using standard Prometheus practices
The following components expose Prometheus metrics within the environment scope:
- kube-state-metrics: information about managed resources
- mxp-gateway: metrics about the ingress into a control plane
- otel-collector: single endpoint for observing inside the Space
Per Space metrics
The otel-collector
exposes metrics from within a specific vCluster,
which runs with the Space itself. Several of the otel-collector
’s receivers
and processors assume it’s running within vCluster as a Kubernetes pod.
The otel-collector
gathers:
- Kubernetes metrics
- Crossplane, Provider, and DNS metrics
- Scrape any pods with auto scrape metrics managed by vCluster
The default configuration has an Open Telemetry trace collector enabled. vCluster sends traces for 10% of the total traces to prevent over-collection of traces and reduce noise.
Configuration
Out of the box, Prometheus and Prometheus-compatable scrappers like DataDog are compatible.
Example Prometheus installation
Most operators probably have a monitoring solution in place, and if not, Amazon, Azure, and Google all offer managed monitoring solutions.
The standard for monitoring Kubernetes and Kubernetes workloads is Prometheus. For operators who to explore the Environment and Spaces metrics, Upbound recommends using Prometheus as a starting point. Prometheus may not meet an Operators business or technology requirements. Upbound advises operators to consider what metrics and observability stack meet meet their needs before running in production.
The recommended way to install Prometheus is through the official helm charts.
Add the Prometheus chart repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Install Prometheus, Alertmanager, Prometheus node exporter, kube-state-metrics, and Grafana:
helm install \
-n monitoring --create-namespace \
prometheus prometheus-community/kube-prometheus-stack
See the official docs.
Add monitors
While you can use static configuration for your Prometheus installation, it’s recommended to use either Service or Pod monitors. All observable services use the standard Prometheus annotations of:
- prometheus.io/port: Name of the pod/service port to scrape
- prometheus.io/scrape: When “true” scrape this port
For those wanting to just scrape OpenTelemetry metrics:
- internal.spaces.upbound.io/metrics-path: Name of the path to scrape
- internal.spaces.upbound.io/metrics-port: Port number to scrape
Assuming your Prometheus helm release name is prometheus
,
the following configures monitoring:
kubectl apply -f - <<EOM
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
# this is the default label used by the prometheus operator to determine which services
# to scrape. You may need to adjust labels depending on your installation.
release: prometheus
name: space-metrics
namespace: monitoring
spec:
endpoints:
- honorLabels: true
path: /metrics
port: metrics
scheme: http
scrapeTimeout: 10s
relabelings:
- action: labeldrop
regex: pod
- action: labeldrop
regex: service
- action: labeldrop
regex: instance
namespaceSelector:
any: true
selector:
matchLabels:
app.kubernetes.io/name: otlp-collector
vcluster.loft.sh/managed-by: vcluster
vcluster.loft.sh/namespace: upbound-system
EOM
Connect to metrics
If you deployed the example Prometheus and Pod Monitors, metrics are available through three views:
- Grafana Dashboard. To access these metrics, run
kubectl port-forward -n monitoring svc/prometheus-grafana 8080:80
and then in your browser navigating to http://localhost:8080, with a default user/password ofadmin
/prom-operator
.- After logging in, select the “+” in the upper right corner, on the top panel
- Select “Import Dashboard”
- Enter
19358
, select your Prometheus instance. You can now see an example Dashboard for monitoring Control Plane health. - Several other metrics show the health of the Kubernetes Cluster, and are useful to show problems outside of Spaces such as resource use, and health of the host cluster.
- Prometheus, which is accessible by running
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 8081:9090
and then using your browser go to http://localhost:8081. Prometheus is useful for ad-hoc by advanced users. - Individual OpenTelemetry Collectors, which can show per-control plane metrics by runningand then navigate using your browser to http://localhost:9090/metrics
NAME=ctp1 kubectl port-forward \ -n mxp-$(kubectl get controlplanes/${NAME} -o jsonpath='{.status.controlPlaneID}{"\n"}')-system \ svc/otlp-collector-x-upbound-system-x-vcluster 9090:9090
The dashboard shows metrics that Upbound has determined to be useful for an overview of the control planes:
- Service Availability: gaps can reveal a problem with a service. Causes of gaps can be pods restarting or failing their service checks.
- Pods Running: Shows the runtime of the pods
- Container Restarts: restarting pods are evidence of a problem, such as OutOfMemory, or errors. Check the pod logs for more information.
- Gateway Errors: Reveal a problem with clients connecting to the gateway. Check the containers logs for potential causes.
- Provider CRDs show the number of CRDs install. Having a lot of CRDs degrades performance of the control plane.
Metrics of import
Both the Environment and each Space and their components produces thousands of metrics. The following metrics, and PromQL queries, can help Operators focus on the most important metrics.
Since each space runs in a namespace, use the variable $namespace
as a filter to scope metrics to a single space. Space namespaces use
the naming convention of mxp-<UUID>-system
.
MXP-Gateway
The MXP-Gateway is a reverse proxy from the Environment to an individual Space. The MXP-Gateway handles a Spaces Authentication and restricts the Group/Version/Kind API calls to Crossplane components.
Success duration
Get the duration of successful queries to the MXP Gateway.
sum without(pod, service, instance)(http_request_duration_milliseconds_sum{kubernetes_pod_name=~"mxp-gateway-.*", http_status_code="200", namespace="$namespace"})
The label http_method
of GET
, PUT
, DELETE
are useful for filtering by type. In general, GET
or PUT
are the most useful.
Failure duration
The duration of failed requests (non-200). High durations could be indicative of problems with the vCluster API.
sum without(instance, job, pod, kubernetes_pod_name)(http_request_duration_milliseconds_sum{kubernetes_pod_name=~"mxp-gateway-.*", http_method="GET", http_status_code!="200", namespace="$namespace"})
The label http_method
of GET
, PUT
, DELETE
can are useful for filtering by type. In general, GET
or PUT
are the most useful.
Resources
Operators can use both CRDs and Managed Resources to understand how utilized a space is.
CRDs
count(apiextensions_openapi_v3_regeneration_count{namespace="$namespace", crd=~".*.(upbound.io|crossplane.io)"})
If you are using custom provider types (for example, org.example.com
), you can
add it to the regular expression. As the number of CRDs increases, the load
on the API server increases by about 4 megabytes per CRD. Upbound has observed instability when a Space exceeds having 500 CRDs.
Having no CRDs installed indicates that a control plane doesn’t have a provider installed.
Managed Resources
Show the number of Managed Resources by group, kind and version.
sum(kube_managedresource_uid{namespace="$namespace"}) by (customresource_group, customresource_kind, customresource_version)
The number of managed resources can be a proxy for potential. The CPU and memory requirements for Crossplane increase as the number of managed resources increases. That’s why it’s useful to observe the number of managed resources
Having no managed resource indicates that the control plane is inactive.
Health
Pod health
sum without(service, uid)(kube_pod_container_status_running{namespace="$namespace"}) >= 1
When viewed as a heat map, this metric gives you a view of the total health and indicates potential pod crashes.
To see the number of pod restarts, use the following query:
increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1m]) > 0
Service availability
Get the service/pod availability. An unavailable pod/service indicates that the workload isn’t ready.
sum without (pod) (label_replace(up{namespace="$namespace", job="pods"}, "pod", "$1", "job", "(.*)"))
Crossplane
Controller reconciliation times
Show the increase in reconciliations times which should ever-increasing. Missing or flat metrics are indicative that Crossplane itself or a provider is in a stuck state (networking, cluster issue, etc).
sum without(pod, instance, endpoint, job, exported_instance)(increase(controller_runtime_reconcile_time_seconds_sum{namespace="$namespace"}[2m]))
Work queue
The depth of the work queue indicates how much work is “waiting” by CRD Group/Version/Kind. A deep work queue is indicative that the system is under load.
workqueue_depth{namespace="$namespace", name=~".*.(crossplane.io|upbound.io)"}
Update the regular expression for name
to include custom CRD/provider types.
Rate for time to reconcile
Show the two-minute. rate for the average reconciliation time. This number is subjective for Cloud resources as it’s affected by API throttling. You should investigate High reconciliation times (5m or more).
sum without (pod, kubernetes_pod_name, service)(increase(controller_runtime_reconcile_total{namespace="$namespace", result="success"}[2m]))
Adjust [2m]
to change the increase over the range of time. Upbound recommends a lower
time value of one or two minutes to show spikes in
behavior. Using times longer than five minutes reduces fidelity.
Rate for errors
Show the two minute rate for the average time to error. This metric should be blank. Errors at the start-up of a control plane, or at configuration-time of a provider, are normal and should go away. You should investigate Consistent errors as they may reveal:
- Credential issues
- Connectivity problems
- Improper configurations of a Claim, XRD or Composition
sum without (pod, kubernetes_pod_name, service)(increase(controller_runtime_reconcile_total{namespace="$namespace", result="error", controller=~"(claim.*|offered.*|composite.*|defined.*|packages.*|revisions.*)"}[2m])) > 0
Adjust [2m]
to change the increase over a range of time. Upbound recommends using a lower
time value of one or two minutes to show spikes in
behavior. Using times longer than five minutes reduces fidelity.
API histogram
Where $operation
is CREATE
, UPDATE
, DELETE
or PUT
, show the 5 min quintile,
for how long the API Service admission took for the Kubernetes API Server.
Low values mean an idle server, while high values could be a symptom the
API Server is under pressure. High values could be due to CPU or memory
pressure, load or potential Environment issues.
histogram_quantile(0.95, sum(rate(apiserver_admission_controller_admission_duration_seconds_bucket{operation="$operation", namespace="$namespace", rejected="false"}[5m])) by (le))
You can edit this query by:
- Change
0.95
to another top percentile such as 0.99 for the TP99 - Set
rejected
totrue
to show rejected requests which show the creation of improperly configured Claims.
API aggregation unavailable
Get the number of APIs not available. If an XRDs is unavailable, such as during an upgrade or if it wasn’t configured right, then the value is non-zero.
aggregator_unavailable_apiservice{name=~"(.*crossplane.io|.*upbound.io)", namespace="$namespace"} > 0
Update the name regular expression to include custom provider types.
Resource usage
The metrics below come from the Kubelet. If your monitoring stack does collect Kubelet metrics (such as through Prometheus Node Exporter via Prometheus Kube Stack Metrics) these metrics may not be available.
Memory
Show the memory usage by pod, with the requests and limits.
Memory usage should be below the limits. The pod may be OOM killed
when it exceeds the limit, resulting in component brownouts.
sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", cluster="$cluster", namespace="$namespace", pod="$pod", container!="", image!=""}) by (container)
CPU throttling
CPU throttling points to an active pod starved for CPU. CPU starvation results in unacceptable performance.
Set $pod
to the name of the pod and $__rate_interval
to
your desired timestamp.
Note: this metric assumes that Prometheus Node Exporter
is running.
sum(increase(container_cpu_cfs_throttled_periods_total{job="kubelet", metrics_path="/metrics/cadvisor", namespace="$namespace", pod="$pod", container!=""}[$__rate_interval])) by (container) / sum(increase(container_cpu_cfs_periods_total{job="kubelet", metrics_path="/metrics/cadvisor", namespace="$namespace", pod="$pod", container!=""}[$__rate_interval])) by (container)