Skip to content

Commit 4a0dcda

Browse files
committed
Enhance metrics
1 parent 02f4d7c commit 4a0dcda

File tree

18 files changed

+974
-3328
lines changed

18 files changed

+974
-3328
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ docker-push: ## Push docker image with the manager.
196196
# architectures. (i.e. make docker-buildx IMG_PREFIX=myregistry/mypoperator IMG_TAG=0.0.1). To use this option you need to:
197197
# - be able to use docker buildx. More info: https://docs.docker.com/build/buildx/
198198
# - have enabled BuildKit. More info: https://docs.docker.com/develop/develop-images/build_enhancements/
199-
# - be able to push the image to your registry (i.e. if you do not set a valid value via IMG_PREFIX:${IMG_TAG} then the export will fail)
199+
# - be able to push the image to your registry (i.e. if you do not set a valid value via ${IMG_PREFIX}:${IMG_TAG} then the export will fail)
200200
# To adequately provide solutions that are compatible with multiple platforms, you should consider using this option.
201201
PLATFORMS ?= linux/arm64,linux/amd64
202202
.PHONY: docker-buildx

cmd/readiness-condition-reporter/main_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -154,8 +154,8 @@ func TestUpdateNodeCondition(t *testing.T) {
154154

155155
if foundCondition == nil {
156156
t.Fatal("Condition not found")
157+
return
157158
}
158-
159159
if foundCondition.Status != tt.wantStatus {
160160
t.Errorf("Condition status = %v, want %v", foundCondition.Status, tt.wantStatus)
161161
}

docs/TEST_README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,3 +154,4 @@ This section tests how the controller handles new nodes being added to the clust
154154

155155
```bash
156156
kind delete cluster --name nrr-test
157+
```
Lines changed: 100 additions & 165 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,48 @@
11
# Monitoring
22

3-
Node Readiness Controller exposes Prometheus-compatible metrics. This page describes the Prometheus metrics exposed by Node Readiness Controller for monitoring rule evaluation, taint operations, failures, and bootstrap progress.
4-
5-
The controller now includes **enhanced SLI metrics** for tracking uptime, freshness, and lag - essential for production monitoring and SLO compliance.
3+
The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.
64

75
## Metrics Endpoint
86

9-
The controller serves metrics on `/metrics` only when metrics are explicitly enabled. Depending on the installation, the endpoint is served either over HTTP or over HTTPS. See [Installation](../user-guide/installation.md) for deployment details.
7+
The controller serves metrics on `/metrics` only when metrics are explicitly enabled.
8+
9+
Depending on the installation, the endpoint is exposed as:
10+
11+
- HTTP on port `8080` when the standard Prometheus component is enabled.
12+
- HTTPS on port `8443` when the Prometheus TLS component is enabled.
13+
14+
See [Installation](https://www.google.com/search?q=../user-guide/installation.md) for deployment details.
15+
16+
## Metric Lifecycle Management
17+
18+
When a `NodeReadinessRule` is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.
19+
20+
**Metrics cleaned up on rule deletion:**
21+
22+
- `node_readiness_taint_operations_total{rule="..."}`
23+
- `node_readiness_evaluation_duration_seconds{rule="..."}`
24+
- `node_readiness_failures_total{rule="..."}`
25+
- `node_readiness_bootstrap_completed_total{rule="..."}`
26+
- `node_readiness_reconciliation_latency_seconds{rule="..."}`
27+
- `node_readiness_bootstrap_duration_seconds{rule="..."}`
28+
- `node_readiness_nodes_by_state{rule="..."}`
29+
- `node_readiness_rule_last_reconciliation_timestamp_seconds{rule="..."}`
30+
31+
This ensures that:
32+
33+
- Deleted rules do not continue to appear in dashboards with stale values.
34+
- Memory usage does not grow unbounded from removed rules.
35+
- Metric cardinality remains highly accurate over time.
36+
37+
**Note:** The global `node_readiness_rules_total` gauge is updated separately. Rule-labeled metrics are explicitly deleted during rule cleanup.
38+
39+
-----
1040

1141
## Core Metrics
1242

1343
### `node_readiness_rules_total`
1444

15-
Number of `NodeReadinessRule` objects tracked by the controller.
45+
Number of `NodeReadinessRule` objects currently tracked by the controller.
1646

1747
| Property | Value |
1848
| --- | --- |
@@ -27,24 +57,17 @@ Total number of taint operations performed by the controller.
2757
| Property | Value |
2858
| --- | --- |
2959
| Type | `counter` |
30-
| Labels | `rule`, `operation` |
60+
| Labels | `rule`, `operation` (`add`, `remove`) |
3161
| Recorded when | The controller successfully adds or removes a taint |
3262

33-
#### Labels
34-
35-
| Label | Description | Values |
36-
| --- | --- | --- |
37-
| `rule` | `NodeReadinessRule` name | Any rule name |
38-
| `operation` | Taint operation performed by the controller | `add`, `remove` |
39-
4063
### `node_readiness_evaluation_duration_seconds`
4164

42-
Duration of rule evaluations.
65+
Duration of the controller's internal rule evaluations.
4366

4467
| Property | Value |
4568
| --- | --- |
4669
| Type | `histogram` |
47-
| Labels | none |
70+
| Labels | `rule` |
4871
| Buckets | Prometheus default histogram buckets |
4972
| Recorded when | The controller evaluates a rule against a node |
5073

@@ -55,214 +78,126 @@ Total number of failure events recorded by the controller.
5578
| Property | Value |
5679
| --- | --- |
5780
| Type | `counter` |
58-
| Labels | `rule`, `reason` |
59-
| Recorded when | The controller records an evaluation failure or taint add/remove failure |
81+
| Labels | `rule`, `reason` (`EvaluationError`, `AddTaintError`, `RemoveTaintError`) |
82+
| Recorded when | The controller encounters an error evaluating or patching a node |
6083

61-
#### Labels
84+
### `node_readiness_bootstrap_completed_total`
6285

63-
| Label | Description | Values |
64-
| --- | --- | --- |
65-
| `rule` | `NodeReadinessRule` name | Any rule name |
86+
Total number of nodes that have completed bootstrap.
6687

67-
## Enhanced SLI Metrics
88+
| Property | Value |
89+
| --- | --- |
90+
| Type | `counter` |
91+
| Labels | `rule` |
92+
| Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
6893

69-
The controller exposes additional metrics for fine-grained monitoring across three key dimensions:
94+
-----
7095

71-
### Latency Metrics
96+
## Extended Health and SLI Metrics
7297

73-
#### `node_readiness_reconciliation_latency_seconds`
98+
### `node_readiness_reconciliation_latency_seconds`
7499

75100
End-to-end latency from node condition change to taint operation completion.
76101

77102
| Property | Value |
78103
| --- | --- |
79104
| Type | `histogram` |
80-
| Labels | `rule`, `operation` |
81-
| Buckets | 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300 seconds |
82-
| Recorded when | A taint operation (add/remove) completes |
105+
| Labels | `rule`, `operation` (`add_taint`, `remove_taint`) |
106+
| Buckets | `0.01`, `0.05`, `0.1`, `0.5`, `1`, `2`, `5`, `10`, `30`, `60`, `120`, `300` seconds |
107+
| Recorded when | A taint operation completes |
83108

84-
**Use Case:** Measure how quickly the controller responds to condition changes. Critical for understanding end-to-end reconciliation performance.
109+
**Use case:** Measure how quickly the controller responds to node condition changes in the cluster.
85110

86-
#### `node_readiness_reconciliation_duration_seconds`
111+
### `node_readiness_bootstrap_duration_seconds`
87112

88-
Duration of complete reconciliation cycle including all operations.
113+
Time from node creation to bootstrap completion for bootstrap-only rules.
89114

90115
| Property | Value |
91116
| --- | --- |
92117
| Type | `histogram` |
93-
| Labels | `rule`, `node` |
94-
| Buckets | 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30 seconds |
95-
| Recorded when | A node reconciliation completes |
96-
97-
**Use Case:** Monitor full reconciliation performance per node. Different from evaluation duration as it includes all operations.
98-
99-
### Freshness Metrics
100-
101-
#### `node_readiness_last_reconciliation_timestamp_seconds`
102-
103-
Unix timestamp of the last successful reconciliation for a node and rule.
104-
105-
| Property | Value |
106-
| --- | --- |
107-
| Type | `gauge` |
108-
| Labels | `rule`, `node` |
109-
| Recorded when | A node reconciliation completes successfully |
118+
| Labels | `rule` |
119+
| Buckets | `1`, `5`, `10`, `30`, `60`, `120`, `300`, `600`, `1200` seconds |
120+
| Recorded when | Bootstrap completion is observed for a node under a bootstrap-only rule |
110121

111-
**Use Case:** Identify stale reconciliations and detect when nodes stop being reconciled.
122+
**Use case:** Measure the actual time nodes take to become fully provisioned and bootstrap-complete.
112123

113-
#### `node_readiness_reconciliation_lag_seconds`
124+
### `node_readiness_nodes_by_state`
114125

115-
Time since last reconciliation for a node and rule (freshness indicator).
126+
Number of nodes in each readiness state per rule.
116127

117128
| Property | Value |
118129
| --- | --- |
119130
| Type | `gauge` |
120-
| Labels | `rule`, `node` |
121-
| Recorded when | A node reconciliation completes |
131+
| Labels | `rule`, `state` (`ready`, `not_ready`, `bootstrapping`) |
132+
| Recorded when | A rule reconciliation completes |
122133

123-
**Use Case:** Monitor reconciliation freshness and detect lag in processing. Essential for detecting performance degradation.
134+
**Use case:** Track aggregate node health without introducing per-node metric cardinality, keeping controller memory footprint lean.
124135

125-
#### `node_readiness_condition_transition_timestamp_seconds`
136+
### `node_readiness_rule_last_reconciliation_timestamp_seconds`
126137

127-
Unix timestamp of the last condition transition for a node.
138+
Unix timestamp of the last reconciliation for a rule.
128139

129140
| Property | Value |
130141
| --- | --- |
131142
| Type | `gauge` |
132-
| Labels | `rule`, `node`, `condition_type` |
133-
| Recorded when | A node condition is evaluated |
134-
135-
**Use Case:** Measure freshness of condition data and detect stale conditions.
136-
137-
### Uptime / Availability Metrics
138-
139-
#### `node_readiness_evaluations_total`
140-
141-
Total number of rule evaluations performed per node.
142-
143-
| Property | Value |
144-
| --- | --- |
145-
| Type | `counter` |
146-
| Labels | `rule`, `node` |
147-
| Recorded when | A rule evaluation starts |
148-
149-
**Use Case:** Understand reconciliation frequency and controller uptime. Useful for detecting when reconciliations stop.
150-
151-
#### `node_readiness_taint_state`
152-
153-
Current taint state for a node and rule.
154-
155-
| Property | Value |
156-
| --- | --- |
157-
| Type | `gauge` |
158-
| Labels | `rule`, `node`, `taint_key` |
159-
| Values | 1 (taint present), 0 (taint absent) |
160-
| Recorded when | A node reconciliation completes |
161-
162-
**Use Case:** Track taint presence for uptime/availability monitoring. Enables real-time visibility into node readiness state.
143+
| Labels | `rule` |
144+
| Recorded when | A rule reconciliation loop successfully completes |
163145

164-
#### `node_readiness_conditions_satisfied`
146+
**Use case:** Detect rules that may be stuck or not reconciling frequently enough.
165147

166-
Whether all conditions are satisfied for a node and rule.
148+
-----
167149

168-
| Property | Value |
169-
| --- | --- |
170-
| Type | `gauge` |
171-
| Labels | `rule`, `node` |
172-
| Values | 1 (all satisfied), 0 (not satisfied) |
173-
| Recorded when | A node reconciliation completes |
150+
## Example Queries & SLOs
174151

175-
**Use Case:** Track node readiness state for SLO monitoring. Essential for availability tracking.
152+
### Latency Monitoring & SLOs
176153

177-
## Example Queries
154+
**Objective:** 95% of internal evaluations complete within 50 milliseconds (0.05s).
178155

179-
### Latency Monitoring
156+
```promql
157+
# Percentage of evaluations completing within 50ms
158+
sum(rate(node_readiness_evaluation_duration_seconds_bucket{le="0.05"}[5m])) /
159+
sum(rate(node_readiness_evaluation_duration_seconds_count[5m])) * 100
160+
```
180161

181162
```promql
182-
# P95 reconciliation latency
163+
# P95 End-to-End Reconciliation Latency across all rules
183164
histogram_quantile(0.95,
184-
rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
165+
sum by (le) (
166+
rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
167+
)
185168
)
186-
187-
# Slow reconciliations (>5s)
188-
sum(rate(node_readiness_reconciliation_duration_seconds_bucket{le="5"}[5m]))
189169
```
190170

191-
### Freshness Monitoring
171+
### Freshness Monitoring & SLOs
192172

193-
```promql
194-
# Average reconciliation lag by rule
195-
avg(node_readiness_reconciliation_lag_seconds) by (rule)
196-
197-
# Nodes not reconciled in last 5 minutes
198-
(time() - node_readiness_last_reconciliation_timestamp_seconds) > 300
199-
```
200-
201-
### Availability Monitoring
173+
**Objective:** All rules reconcile within the last 2 minutes.
202174

203175
```promql
204-
# Percentage of nodes with satisfied conditions
205-
avg(node_readiness_conditions_satisfied) by (rule) * 100
206-
207-
# Count of tainted (not ready) nodes
208-
sum(node_readiness_taint_state) by (rule)
176+
# Alert if any rule has not reconciled in the last 120 seconds
177+
(time() - node_readiness_rule_last_reconciliation_timestamp_seconds) > 120
209178
```
210179

211-
## SLI/SLO Examples
180+
### Availability Monitoring & SLOs
212181

213-
### Latency SLO
214-
**Objective:** 95% of reconciliations complete within 5 seconds
182+
**Objective:** 99.9% of targeted nodes are ready.
215183

216184
```promql
217-
sum(rate(node_readiness_reconciliation_duration_seconds_bucket{le="5"}[5m])) /
218-
sum(rate(node_readiness_reconciliation_duration_seconds_count[5m])) * 100
219-
```
220-
221-
### Freshness SLO
222-
**Objective:** 99% of nodes reconciled within last 2 minutes
223-
224-
```promql
225-
sum(node_readiness_reconciliation_lag_seconds < 120) /
226-
count(node_readiness_reconciliation_lag_seconds) * 100
227-
```
185+
# Percentage of ready nodes globally
186+
100 * sum(node_readiness_nodes_by_state{state="ready"}) / sum(node_readiness_nodes_by_state)
228187
229-
### Availability SLO
230-
**Objective:** 99.9% of nodes have conditions satisfied
231-
232-
```promql
233-
avg(node_readiness_conditions_satisfied) * 100
188+
# Percentage of ready nodes per rule
189+
100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)
234190
```
235191

236-
## Grafana Dashboard
192+
## Monitoring and Scale Testing
237193

238-
A comprehensive Grafana dashboard with panels for all metrics is available. See:
239-
- [Detailed Metrics Documentation](../../../metrics.md)
240-
- [Grafana Dashboard Panels](../../../grafana-dashboard-panels.md)
241-
- [Scale Testing Setup](../../../../hack/test-workloads/scale/README.md)
194+
For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the [scale testing guide](../../../../hack/test-workloads/scale/README.md).
242195

243196
## Alerting Recommendations
244197

245-
Set up alerts for:
246-
- **High Latency:** P95 latency > 10s for 5 minutes
247-
- **Stale Reconciliations:** Lag > 5 minutes
248-
- **High Failure Rate:** Failures > 0.1/sec for 5 minutes
249-
- **Low Availability:** < 95% nodes ready for 10 minutes
250-
251-
See the [detailed metrics documentation](../../../metrics.md) for complete alerting rule examples.
252-
| `reason` | Failure label recorded by the controller | `EvaluationError`, `AddTaintError`, `RemoveTaintError` |
253-
254-
### `node_readiness_bootstrap_completed_total`
255-
256-
Total number of nodes that have completed bootstrap.
257-
258-
| Property | Value |
259-
| --- | --- |
260-
| Type | `counter` |
261-
| Labels | `rule` |
262-
| Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
263-
264-
#### Labels
198+
Typical alerts to consider:
265199

266-
| Label | Description | Values |
267-
| --- | --- | --- |
268-
| `rule` | `NodeReadinessRule` name | Any rule name |
200+
- **High latency:** P95 reconciliation latency above 10s for 5 minutes.
201+
- **Stale reconciliations:** Any rule with no reconciliation for more than 5 minutes.
202+
- **High failure rate:** Sustained increase in `node_readiness_failures_total`.
203+
- **Low availability:** Ready-node percentage below your target threshold for a sustained period.

0 commit comments

Comments
 (0)