You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Node Readiness Controller exposes Prometheus-compatible metrics. This page describes the Prometheus metrics exposed by Node Readiness Controller for monitoring rule evaluation, taint operations, failures, and bootstrap progress.
3
+
The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.
4
4
5
5
## Metrics Endpoint
6
6
7
-
The controller serves metrics on `/metrics` only when metrics are explicitly enabled. Depending on the installation, the endpoint is served either over HTTP or over HTTPS. See [Installation](../user-guide/installation.md) for deployment details.
7
+
The controller serves metrics on `/metrics` only when metrics are explicitly enabled.
8
8
9
-
## Supported Metrics
9
+
Depending on the installation, the endpoint is exposed as:
10
+
11
+
- HTTP on port `8080` when the standard Prometheus component is enabled.
12
+
- HTTPS on port `8443` when the Prometheus TLS component is enabled.
13
+
14
+
See [Installation](https://www.google.com/search?q=../user-guide/installation.md) for deployment details.
15
+
16
+
## Metric Lifecycle Management
17
+
18
+
When a `NodeReadinessRule` is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.
100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)
190
+
```
191
+
192
+
## Monitoring and Scale Testing
193
+
194
+
For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the [scale testing guide](../../../../hack/test-workloads/scale/README.md).
195
+
196
+
## Alerting Recommendations
197
+
198
+
Typical alerts to consider:
77
199
78
-
| Label | Description | Values |
79
-
| --- | --- | --- |
80
-
|`rule`|`NodeReadinessRule` name | Any rule name |
200
+
-**High latency:** P95 reconciliation latency above 10s for 5 minutes.
201
+
-**Stale reconciliations:** Any rule with no reconciliation for more than 5 minutes.
202
+
-**High failure rate:** Sustained increase in `node_readiness_failures_total`.
203
+
-**Low availability:** Ready-node percentage below your target threshold for a sustained period.
0 commit comments