Sharding
When there are too much data to ingest and process, scaling Prometheus vertically may come to an end and it might become necessary to distribute scraped targets across multiple Prometheus shards.
Design
The Prometheus operator will create .spec.shards StatefulSets multiplied by .spec.replicas pods.
By default, shards use the Prometheus modulus configuration which takes the hash of the source label values in order to split scraped
targets based on the number of shards. By default, Prometheus hashes the value of the
__address__label forServiceMonitorandPodMonitorresources__param_target__label forProberesources
To query globally, deploy the Thanos querier connecting to all Thanos sidecars (in the same way, use the Thanos ruler to evaluate rules across shards). Another option is to remote write the samples to a central location.
Limitations:
- Scaling down the number of shards doesn’t reshard existing data onto remaining instances. It must be manually moved (see also Scaling below).
- Scaling up the number of shards will not reshard existing data either. It will continue to be available from the same instances.
Configuration
Implementing a custom target distribution
To implement a custom distribution, set the __tmp_hash label during target discovery using relabeling configuration. The operator uses this label’s value instead of the default labels when computing the shard assignment.
For example, to shard targets by pod namespace and name rather than by address:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: example-app
labels:
team: frontend
spec:
selector:
matchLabels:
app: example-app
endpoints:
- port: web
relabelings:
- sourceLabels: [__meta_kubernetes_pod_namespace, __meta_kubernetes_pod_name]
separator: /
targetLabel: __tmp_hash
The relabeling can also be applied at the scrape class level to affect multiple monitoring resources at once.
Scraping a target from all the shards
By default, each target is assigned to exactly one shard. To have all shards scrape the same target — useful for singleton services such as kube-state-metrics where every shard needs the full set of metrics — set the __tmp_disable_sharding label to a non-empty value using relabeling configuration.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kube-state-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
endpoints:
- port: http
relabelings:
- targetLabel: __tmp_disable_sharding
replacement: "true"
Topology-aware sharding
Beta: Topology-aware sharding requires the
PrometheusTopologyShardingfeature gate to be enabled on the operator.
In multi-zone clusters, the default address-based sharding distributes targets without regard for their zone, which can generate costly cross-zone traffic. Topology-aware sharding pins each Prometheus shard to a specific zone so that it only scrapes targets local to that zone.
When mode: Topology is set, the operator:
- Generates relabeling rules so each shard keeps only targets whose zone label matches its assigned zone.
- Automatically adds a
nodeSelectorto schedule each shard’s pods in the correct zone. - Adds a
zoneexternal label to each shard’s Prometheus configuration (configurable viaexternalLabelName).
The number of shards must be greater than or equal to the number of topology values. When spec.shards is a multiple of the number of zones, the shards are evenly distributed across zones. Otherwise, some zones receive more shards than others.
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
shards: 4
replicas: 2
shardingStrategy:
mode: Topology
topology:
values:
- europe-west4-a
- europe-west4-b
With this configuration and 4 shards across 2 zones, shards 0 and 2 are scheduled in europe-west4-a and shards 1 and 3 in europe-west4-b. Each shard only scrapes targets in its zone.
Retaining shards
Beta: Shard retention requires the
PrometheusShardRetentionPolicyfeature gate to be enabled on the operator.
When scaling down the number of shards, the pods from the removed shards are deleted by default along with access to their historical data. To preserve scaled-down shards so their data remains queryable until the retention duration expires, set .spec.shardRetentionPolicy.whenScaled to Retain:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
shards: 2
shardRetentionPolicy:
whenScaled: Retain
Retained shards continue running and can be queried via the Thanos sidecar and querier alongside the active shards. By default, the operator deletes them once the Prometheus retention time has been reached. This can be overridden with the retain.retentionPeriod field:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
shards: 2
retentionSize: 100Gi
shardRetentionPolicy:
whenScaled: Retain
retain:
retentionPeriod: 7d
Note: If the Prometheus resource uses size-based retention only (no retention time configured), retained shards are kept forever by default.
Example
The following manifest creates a Prometheus server with two replicas:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: prometheus
name: prometheus
namespace: default
spec:
serviceAccountName: prometheus
replicas: 2
serviceMonitorSelector:
matchLabels:
team: frontend
This can be verified with the following command:
kubectl get pods -n default
The output is similar to this:
prometheus-prometheus-0 2/2 Running 1 10s
prometheus-prometheus-1 2/2 Running 1 10s
Deploy the example application and monitor it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: example-app
template:
metadata:
labels:
app: example-app
spec:
containers:
- name: example-app
image: quay.io/brancz/prometheus-example-app:v0.5.0
ports:
- name: web
containerPort: 8080
kind: Service
apiVersion: v1
metadata:
name: example-app
namespace: default
labels:
app: example-app
spec:
selector:
app: example-app
ports:
- name: web
port: 8080
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: example-app
namespace: default
labels:
team: frontend
spec:
selector:
matchLabels:
app: example-app
endpoints:
- port: web
Explore one of the monitoring Prometheus instances:
kubectl port-forward pod/prometheus-prometheus-0 9090:9090
We find the prometheus server scrapes three targets.
Now let’s expand the Prometheus resource to two shards as shown below:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: shards
name: prometheus
namespace: default
spec:
serviceAccountName: prometheus
replicas: 2
shards: 2
serviceMonitorSelector:
matchLabels:
team: frontend
This can be verified with the following command:
kubectl get pods -n <namespace>
The output is similar to this:
prometheus-prometheus-0 2/2 Running 1 11m
prometheus-prometheus-1 2/2 Running 1 11m
prometheus-prometheus-shard-1-0 2/2 Running 1 12s
prometheus-prometheus-shard-1-1 2/2 Running 1 12s
Explore one of the monitoring Prometheus instances added for sharding:
kubectl port-forward prometheus-prometheus-shard-1-0 9091:9090
We should find that one or two targets are being scraped by this instance while the original Prometheus shard scrapes the remaining target(s).