11# Monitoring
22
3- Node Readiness Controller exposes Prometheus-compatible metrics. This page describes the Prometheus metrics exposed by Node Readiness Controller for monitoring rule evaluation, taint operations, failures, and bootstrap progress.
4-
5- The controller now includes ** enhanced SLI metrics** for tracking uptime, freshness, and lag - essential for production monitoring and SLO compliance.
3+ The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.
64
75## Metrics Endpoint
86
9- The controller serves metrics on ` /metrics ` only when metrics are explicitly enabled. Depending on the installation, the endpoint is served either over HTTP or over HTTPS. See [ Installation] ( ../user-guide/installation.md ) for deployment details.
7+ The controller serves metrics on ` /metrics ` only when metrics are explicitly enabled.
8+
9+ Depending on the installation, the endpoint is exposed as:
10+
11+ - HTTP on port ` 8080 ` when the standard Prometheus component is enabled.
12+ - HTTPS on port ` 8443 ` when the Prometheus TLS component is enabled.
13+
14+ See [ Installation] ( https://www.google.com/search?q=../user-guide/installation.md ) for deployment details.
15+
16+ ## Metric Lifecycle Management
17+
18+ When a ` NodeReadinessRule ` is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.
19+
20+ ** Metrics cleaned up on rule deletion:**
21+
22+ - ` node_readiness_taint_operations_total{rule="..."} `
23+ - ` node_readiness_evaluation_duration_seconds{rule="..."} `
24+ - ` node_readiness_failures_total{rule="..."} `
25+ - ` node_readiness_bootstrap_completed_total{rule="..."} `
26+ - ` node_readiness_reconciliation_latency_seconds{rule="..."} `
27+ - ` node_readiness_bootstrap_duration_seconds{rule="..."} `
28+ - ` node_readiness_nodes_by_state{rule="..."} `
29+ - ` node_readiness_rule_last_reconciliation_timestamp_seconds{rule="..."} `
30+
31+ This ensures that:
32+
33+ - Deleted rules do not continue to appear in dashboards with stale values.
34+ - Memory usage does not grow unbounded from removed rules.
35+ - Metric cardinality remains highly accurate over time.
36+
37+ ** Note:** The global ` node_readiness_rules_total ` gauge is updated separately. Rule-labeled metrics are explicitly deleted during rule cleanup.
38+
39+ -----
1040
1141## Core Metrics
1242
1343### ` node_readiness_rules_total `
1444
15- Number of ` NodeReadinessRule ` objects tracked by the controller.
45+ Number of ` NodeReadinessRule ` objects currently tracked by the controller.
1646
1747| Property | Value |
1848| --- | --- |
@@ -27,24 +57,17 @@ Total number of taint operations performed by the controller.
2757| Property | Value |
2858| --- | --- |
2959| Type | ` counter ` |
30- | Labels | ` rule ` , ` operation ` |
60+ | Labels | ` rule ` , ` operation ` ( ` add ` , ` remove ` ) |
3161| Recorded when | The controller successfully adds or removes a taint |
3262
33- #### Labels
34-
35- | Label | Description | Values |
36- | --- | --- | --- |
37- | ` rule ` | ` NodeReadinessRule ` name | Any rule name |
38- | ` operation ` | Taint operation performed by the controller | ` add ` , ` remove ` |
39-
4063### ` node_readiness_evaluation_duration_seconds `
4164
42- Duration of rule evaluations.
65+ Duration of the controller's internal rule evaluations.
4366
4467| Property | Value |
4568| --- | --- |
4669| Type | ` histogram ` |
47- | Labels | none |
70+ | Labels | ` rule ` |
4871| Buckets | Prometheus default histogram buckets |
4972| Recorded when | The controller evaluates a rule against a node |
5073
@@ -55,214 +78,126 @@ Total number of failure events recorded by the controller.
5578| Property | Value |
5679| --- | --- |
5780| Type | ` counter ` |
58- | Labels | ` rule ` , ` reason ` |
59- | Recorded when | The controller records an evaluation failure or taint add/remove failure |
81+ | Labels | ` rule ` , ` reason ` ( ` EvaluationError ` , ` AddTaintError ` , ` RemoveTaintError ` ) |
82+ | Recorded when | The controller encounters an error evaluating or patching a node |
6083
61- #### Labels
84+ ### ` node_readiness_bootstrap_completed_total `
6285
63- | Label | Description | Values |
64- | --- | --- | --- |
65- | ` rule ` | ` NodeReadinessRule ` name | Any rule name |
86+ Total number of nodes that have completed bootstrap.
6687
67- ## Enhanced SLI Metrics
88+ | Property | Value |
89+ | --- | --- |
90+ | Type | ` counter ` |
91+ | Labels | ` rule ` |
92+ | Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
6893
69- The controller exposes additional metrics for fine-grained monitoring across three key dimensions:
94+ -----
7095
71- ### Latency Metrics
96+ ## Extended Health and SLI Metrics
7297
73- #### ` node_readiness_reconciliation_latency_seconds `
98+ ### ` node_readiness_reconciliation_latency_seconds `
7499
75100End-to-end latency from node condition change to taint operation completion.
76101
77102| Property | Value |
78103| --- | --- |
79104| Type | ` histogram ` |
80- | Labels | ` rule ` , ` operation ` |
81- | Buckets | 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300 seconds |
82- | Recorded when | A taint operation (add/remove) completes |
105+ | Labels | ` rule ` , ` operation ` ( ` add_taint ` , ` remove_taint ` ) |
106+ | Buckets | ` 0.01 ` , ` 0.05 ` , ` 0.1 ` , ` 0.5 ` , ` 1 ` , ` 2 ` , ` 5 ` , ` 10 ` , ` 30 ` , ` 60 ` , ` 120 ` , ` 300 ` seconds |
107+ | Recorded when | A taint operation completes |
83108
84- ** Use Case :** Measure how quickly the controller responds to condition changes. Critical for understanding end-to-end reconciliation performance .
109+ ** Use case :** Measure how quickly the controller responds to node condition changes in the cluster .
85110
86- #### ` node_readiness_reconciliation_duration_seconds `
111+ ### ` node_readiness_bootstrap_duration_seconds `
87112
88- Duration of complete reconciliation cycle including all operations .
113+ Time from node creation to bootstrap completion for bootstrap-only rules .
89114
90115| Property | Value |
91116| --- | --- |
92117| Type | ` histogram ` |
93- | Labels | ` rule ` , ` node ` |
94- | Buckets | 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30 seconds |
95- | Recorded when | A node reconciliation completes |
96-
97- ** Use Case:** Monitor full reconciliation performance per node. Different from evaluation duration as it includes all operations.
98-
99- ### Freshness Metrics
100-
101- #### ` node_readiness_last_reconciliation_timestamp_seconds `
102-
103- Unix timestamp of the last successful reconciliation for a node and rule.
104-
105- | Property | Value |
106- | --- | --- |
107- | Type | ` gauge ` |
108- | Labels | ` rule ` , ` node ` |
109- | Recorded when | A node reconciliation completes successfully |
118+ | Labels | ` rule ` |
119+ | Buckets | ` 1 ` , ` 5 ` , ` 10 ` , ` 30 ` , ` 60 ` , ` 120 ` , ` 300 ` , ` 600 ` , ` 1200 ` seconds |
120+ | Recorded when | Bootstrap completion is observed for a node under a bootstrap-only rule |
110121
111- ** Use Case :** Identify stale reconciliations and detect when nodes stop being reconciled .
122+ ** Use case :** Measure the actual time nodes take to become fully provisioned and bootstrap-complete .
112123
113- #### ` node_readiness_reconciliation_lag_seconds `
124+ ### ` node_readiness_nodes_by_state `
114125
115- Time since last reconciliation for a node and rule (freshness indicator) .
126+ Number of nodes in each readiness state per rule.
116127
117128| Property | Value |
118129| --- | --- |
119130| Type | ` gauge ` |
120- | Labels | ` rule ` , ` node ` |
121- | Recorded when | A node reconciliation completes |
131+ | Labels | ` rule ` , ` state ` ( ` ready ` , ` not_ready ` , ` bootstrapping ` ) |
132+ | Recorded when | A rule reconciliation completes |
122133
123- ** Use Case :** Monitor reconciliation freshness and detect lag in processing. Essential for detecting performance degradation .
134+ ** Use case :** Track aggregate node health without introducing per-node metric cardinality, keeping controller memory footprint lean .
124135
125- #### ` node_readiness_condition_transition_timestamp_seconds `
136+ ### ` node_readiness_rule_last_reconciliation_timestamp_seconds `
126137
127- Unix timestamp of the last condition transition for a node .
138+ Unix timestamp of the last reconciliation for a rule .
128139
129140| Property | Value |
130141| --- | --- |
131142| Type | ` gauge ` |
132- | Labels | ` rule ` , ` node ` , ` condition_type ` |
133- | Recorded when | A node condition is evaluated |
134-
135- ** Use Case:** Measure freshness of condition data and detect stale conditions.
136-
137- ### Uptime / Availability Metrics
138-
139- #### ` node_readiness_evaluations_total `
140-
141- Total number of rule evaluations performed per node.
142-
143- | Property | Value |
144- | --- | --- |
145- | Type | ` counter ` |
146- | Labels | ` rule ` , ` node ` |
147- | Recorded when | A rule evaluation starts |
148-
149- ** Use Case:** Understand reconciliation frequency and controller uptime. Useful for detecting when reconciliations stop.
150-
151- #### ` node_readiness_taint_state `
152-
153- Current taint state for a node and rule.
154-
155- | Property | Value |
156- | --- | --- |
157- | Type | ` gauge ` |
158- | Labels | ` rule ` , ` node ` , ` taint_key ` |
159- | Values | 1 (taint present), 0 (taint absent) |
160- | Recorded when | A node reconciliation completes |
161-
162- ** Use Case:** Track taint presence for uptime/availability monitoring. Enables real-time visibility into node readiness state.
143+ | Labels | ` rule ` |
144+ | Recorded when | A rule reconciliation loop successfully completes |
163145
164- #### ` node_readiness_conditions_satisfied `
146+ ** Use case: ** Detect rules that may be stuck or not reconciling frequently enough.
165147
166- Whether all conditions are satisfied for a node and rule.
148+ -----
167149
168- | Property | Value |
169- | --- | --- |
170- | Type | ` gauge ` |
171- | Labels | ` rule ` , ` node ` |
172- | Values | 1 (all satisfied), 0 (not satisfied) |
173- | Recorded when | A node reconciliation completes |
150+ ## Example Queries & SLOs
174151
175- ** Use Case: ** Track node readiness state for SLO monitoring. Essential for availability tracking.
152+ ### Latency Monitoring & SLOs
176153
177- ## Example Queries
154+ ** Objective: ** 95% of internal evaluations complete within 50 milliseconds (0.05s).
178155
179- ### Latency Monitoring
156+ ``` promql
157+ # Percentage of evaluations completing within 50ms
158+ sum(rate(node_readiness_evaluation_duration_seconds_bucket{le="0.05"}[5m])) /
159+ sum(rate(node_readiness_evaluation_duration_seconds_count[5m])) * 100
160+ ```
180161
181162``` promql
182- # P95 reconciliation latency
163+ # P95 End-to-End Reconciliation Latency across all rules
183164histogram_quantile(0.95,
184- rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
165+ sum by (le) (
166+ rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
167+ )
185168)
186-
187- # Slow reconciliations (>5s)
188- sum(rate(node_readiness_reconciliation_duration_seconds_bucket{le="5"}[5m]))
189169```
190170
191- ### Freshness Monitoring
171+ ### Freshness Monitoring & SLOs
192172
193- ``` promql
194- # Average reconciliation lag by rule
195- avg(node_readiness_reconciliation_lag_seconds) by (rule)
196-
197- # Nodes not reconciled in last 5 minutes
198- (time() - node_readiness_last_reconciliation_timestamp_seconds) > 300
199- ```
200-
201- ### Availability Monitoring
173+ ** Objective:** All rules reconcile within the last 2 minutes.
202174
203175``` promql
204- # Percentage of nodes with satisfied conditions
205- avg(node_readiness_conditions_satisfied) by (rule) * 100
206-
207- # Count of tainted (not ready) nodes
208- sum(node_readiness_taint_state) by (rule)
176+ # Alert if any rule has not reconciled in the last 120 seconds
177+ (time() - node_readiness_rule_last_reconciliation_timestamp_seconds) > 120
209178```
210179
211- ## SLI/SLO Examples
180+ ### Availability Monitoring & SLOs
212181
213- ### Latency SLO
214- ** Objective:** 95% of reconciliations complete within 5 seconds
182+ ** Objective:** 99.9% of targeted nodes are ready.
215183
216184``` promql
217- sum(rate(node_readiness_reconciliation_duration_seconds_bucket{le="5"}[5m])) /
218- sum(rate(node_readiness_reconciliation_duration_seconds_count[5m])) * 100
219- ```
220-
221- ### Freshness SLO
222- ** Objective:** 99% of nodes reconciled within last 2 minutes
223-
224- ``` promql
225- sum(node_readiness_reconciliation_lag_seconds < 120) /
226- count(node_readiness_reconciliation_lag_seconds) * 100
227- ```
185+ # Percentage of ready nodes globally
186+ 100 * sum(node_readiness_nodes_by_state{state="ready"}) / sum(node_readiness_nodes_by_state)
228187
229- ### Availability SLO
230- ** Objective:** 99.9% of nodes have conditions satisfied
231-
232- ``` promql
233- avg(node_readiness_conditions_satisfied) * 100
188+ # Percentage of ready nodes per rule
189+ 100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)
234190```
235191
236- ## Grafana Dashboard
192+ ## Monitoring and Scale Testing
237193
238- A comprehensive Grafana dashboard with panels for all metrics is available. See:
239- - [ Detailed Metrics Documentation] ( ../../../metrics.md )
240- - [ Grafana Dashboard Panels] ( ../../../grafana-dashboard-panels.md )
241- - [ Scale Testing Setup] ( ../../../../hack/test-workloads/scale/README.md )
194+ For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the [ scale testing guide] ( ../../../../hack/test-workloads/scale/README.md ) .
242195
243196## Alerting Recommendations
244197
245- Set up alerts for:
246- - ** High Latency:** P95 latency > 10s for 5 minutes
247- - ** Stale Reconciliations:** Lag > 5 minutes
248- - ** High Failure Rate:** Failures > 0.1/sec for 5 minutes
249- - ** Low Availability:** < 95% nodes ready for 10 minutes
250-
251- See the [ detailed metrics documentation] ( ../../../metrics.md ) for complete alerting rule examples.
252- | ` reason ` | Failure label recorded by the controller | ` EvaluationError ` , ` AddTaintError ` , ` RemoveTaintError ` |
253-
254- ### ` node_readiness_bootstrap_completed_total `
255-
256- Total number of nodes that have completed bootstrap.
257-
258- | Property | Value |
259- | --- | --- |
260- | Type | ` counter ` |
261- | Labels | ` rule ` |
262- | Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
263-
264- #### Labels
198+ Typical alerts to consider:
265199
266- | Label | Description | Values |
267- | --- | --- | --- |
268- | ` rule ` | ` NodeReadinessRule ` name | Any rule name |
200+ - ** High latency:** P95 reconciliation latency above 10s for 5 minutes.
201+ - ** Stale reconciliations:** Any rule with no reconciliation for more than 5 minutes.
202+ - ** High failure rate:** Sustained increase in ` node_readiness_failures_total ` .
203+ - ** Low availability:** Ready-node percentage below your target threshold for a sustained period.
0 commit comments