Skip to content

Commit e84c8c2

Browse files
committed
Add scale tests with monitoring
1 parent ee2c4b5 commit e84c8c2

File tree

14 files changed

+2098
-244
lines changed

14 files changed

+2098
-244
lines changed

cmd/readiness-condition-reporter/main_test.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,6 @@ func TestUpdateNodeCondition(t *testing.T) {
178178
if foundCondition == nil {
179179
t.Fatal("Condition not found")
180180
}
181-
182181
if foundCondition.Status != tt.wantStatus {
183182
t.Errorf("Condition status = %v, want %v", foundCondition.Status, tt.wantStatus)
184183
}
Lines changed: 150 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,48 @@
11
# Monitoring
22

3-
Node Readiness Controller exposes Prometheus-compatible metrics. This page describes the Prometheus metrics exposed by Node Readiness Controller for monitoring rule evaluation, taint operations, failures, and bootstrap progress.
3+
The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.
44

55
## Metrics Endpoint
66

7-
The controller serves metrics on `/metrics` only when metrics are explicitly enabled. Depending on the installation, the endpoint is served either over HTTP or over HTTPS. See [Installation](../user-guide/installation.md) for deployment details.
7+
The controller serves metrics on `/metrics` only when metrics are explicitly enabled.
88

9-
## Supported Metrics
9+
Depending on the installation, the endpoint is exposed as:
10+
11+
- HTTP on port `8080` when the standard Prometheus component is enabled.
12+
- HTTPS on port `8443` when the Prometheus TLS component is enabled.
13+
14+
See [Installation](https://www.google.com/search?q=../user-guide/installation.md) for deployment details.
15+
16+
## Metric Lifecycle Management
17+
18+
When a `NodeReadinessRule` is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.
19+
20+
**Metrics cleaned up on rule deletion:**
21+
22+
- `node_readiness_taint_operations_total{rule="..."}`
23+
- `node_readiness_evaluation_duration_seconds{rule="..."}`
24+
- `node_readiness_failures_total{rule="..."}`
25+
- `node_readiness_bootstrap_completed_total{rule="..."}`
26+
- `node_readiness_reconciliation_latency_seconds{rule="..."}`
27+
- `node_readiness_bootstrap_duration_seconds{rule="..."}`
28+
- `node_readiness_nodes_by_state{rule="..."}`
29+
- `node_readiness_rule_last_reconciliation_timestamp_seconds{rule="..."}`
30+
31+
This ensures that:
32+
33+
- Deleted rules do not continue to appear in dashboards with stale values.
34+
- Memory usage does not grow unbounded from removed rules.
35+
- Metric cardinality remains highly accurate over time.
36+
37+
**Note:** The global `node_readiness_rules_total` gauge is updated separately. Rule-labeled metrics are explicitly deleted during rule cleanup.
38+
39+
-----
40+
41+
## Core Metrics
1042

1143
### `node_readiness_rules_total`
1244

13-
Number of `NodeReadinessRule` objects tracked by the controller.
45+
Number of `NodeReadinessRule` objects currently tracked by the controller.
1446

1547
| Property | Value |
1648
| --- | --- |
@@ -25,24 +57,17 @@ Total number of taint operations performed by the controller.
2557
| Property | Value |
2658
| --- | --- |
2759
| Type | `counter` |
28-
| Labels | `rule`, `operation` |
60+
| Labels | `rule`, `operation` (`add`, `remove`) |
2961
| Recorded when | The controller successfully adds or removes a taint |
3062

31-
#### Labels
32-
33-
| Label | Description | Values |
34-
| --- | --- | --- |
35-
| `rule` | `NodeReadinessRule` name | Any rule name |
36-
| `operation` | Taint operation performed by the controller | `add`, `remove` |
37-
3863
### `node_readiness_evaluation_duration_seconds`
3964

40-
Duration of rule evaluations.
65+
Duration of the controller's internal rule evaluations.
4166

4267
| Property | Value |
4368
| --- | --- |
4469
| Type | `histogram` |
45-
| Labels | none |
70+
| Labels | `rule` |
4671
| Buckets | Prometheus default histogram buckets |
4772
| Recorded when | The controller evaluates a rule against a node |
4873

@@ -53,15 +78,8 @@ Total number of failure events recorded by the controller.
5378
| Property | Value |
5479
| --- | --- |
5580
| Type | `counter` |
56-
| Labels | `rule`, `reason` |
57-
| Recorded when | The controller records an evaluation failure or taint add/remove failure |
58-
59-
#### Labels
60-
61-
| Label | Description | Values |
62-
| --- | --- | --- |
63-
| `rule` | `NodeReadinessRule` name | Any rule name |
64-
| `reason` | Failure label recorded by the controller | `EvaluationError`, `AddTaintError`, `RemoveTaintError` |
81+
| Labels | `rule`, `reason` (`EvaluationError`, `AddTaintError`, `RemoveTaintError`) |
82+
| Recorded when | The controller encounters an error evaluating or patching a node |
6583

6684
### `node_readiness_bootstrap_completed_total`
6785

@@ -73,8 +91,113 @@ Total number of nodes that have completed bootstrap.
7391
| Labels | `rule` |
7492
| Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
7593

76-
#### Labels
94+
-----
95+
96+
## Extended Health and SLI Metrics
97+
98+
### `node_readiness_reconciliation_latency_seconds`
99+
100+
End-to-end latency from node condition change to taint operation completion.
101+
102+
| Property | Value |
103+
| --- | --- |
104+
| Type | `histogram` |
105+
| Labels | `rule`, `operation` (`add_taint`, `remove_taint`) |
106+
| Buckets | `0.01`, `0.05`, `0.1`, `0.5`, `1`, `2`, `5`, `10`, `30`, `60`, `120`, `300` seconds |
107+
| Recorded when | A taint operation completes |
108+
109+
**Use case:** Measure how quickly the controller responds to node condition changes in the cluster.
110+
111+
### `node_readiness_bootstrap_duration_seconds`
112+
113+
Time from node creation to bootstrap completion for bootstrap-only rules.
114+
115+
| Property | Value |
116+
| --- | --- |
117+
| Type | `histogram` |
118+
| Labels | `rule` |
119+
| Buckets | `1`, `5`, `10`, `30`, `60`, `120`, `300`, `600`, `1200` seconds |
120+
| Recorded when | Bootstrap completion is observed for a node under a bootstrap-only rule |
121+
122+
**Use case:** Measure the actual time nodes take to become fully provisioned and bootstrap-complete.
123+
124+
### `node_readiness_nodes_by_state`
125+
126+
Number of nodes in each readiness state per rule.
127+
128+
| Property | Value |
129+
| --- | --- |
130+
| Type | `gauge` |
131+
| Labels | `rule`, `state` (`ready`, `not_ready`, `bootstrapping`) |
132+
| Recorded when | A rule reconciliation completes |
133+
134+
**Use case:** Track aggregate node health without introducing per-node metric cardinality, keeping controller memory footprint lean.
135+
136+
### `node_readiness_rule_last_reconciliation_timestamp_seconds`
137+
138+
Unix timestamp of the last reconciliation for a rule.
139+
140+
| Property | Value |
141+
| --- | --- |
142+
| Type | `gauge` |
143+
| Labels | `rule` |
144+
| Recorded when | A rule reconciliation loop successfully completes |
145+
146+
**Use case:** Detect rules that may be stuck or not reconciling frequently enough.
147+
148+
-----
149+
150+
## Example Queries & SLOs
151+
152+
### Latency Monitoring & SLOs
153+
154+
**Objective:** 95% of internal evaluations complete within 50 milliseconds (0.05s).
155+
156+
```promql
157+
# Percentage of evaluations completing within 50ms
158+
sum(rate(node_readiness_evaluation_duration_seconds_bucket{le="0.05"}[5m])) /
159+
sum(rate(node_readiness_evaluation_duration_seconds_count[5m])) * 100
160+
```
161+
162+
```promql
163+
# P95 End-to-End Reconciliation Latency across all rules
164+
histogram_quantile(0.95,
165+
sum by (le) (
166+
rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
167+
)
168+
)
169+
```
170+
171+
### Freshness Monitoring & SLOs
172+
173+
**Objective:** All rules reconcile within the last 2 minutes.
174+
175+
```promql
176+
# Alert if any rule has not reconciled in the last 120 seconds
177+
(time() - node_readiness_rule_last_reconciliation_timestamp_seconds) > 120
178+
```
179+
180+
### Availability Monitoring & SLOs
181+
182+
**Objective:** 99.9% of targeted nodes are ready.
183+
184+
```promql
185+
# Percentage of ready nodes globally
186+
100 * sum(node_readiness_nodes_by_state{state="ready"}) / sum(node_readiness_nodes_by_state)
187+
188+
# Percentage of ready nodes per rule
189+
100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)
190+
```
191+
192+
## Monitoring and Scale Testing
193+
194+
For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the [scale testing guide](../../../../hack/test-workloads/scale/README.md).
195+
196+
## Alerting Recommendations
197+
198+
Typical alerts to consider:
77199

78-
| Label | Description | Values |
79-
| --- | --- | --- |
80-
| `rule` | `NodeReadinessRule` name | Any rule name |
200+
- **High latency:** P95 reconciliation latency above 10s for 5 minutes.
201+
- **Stale reconciliations:** Any rule with no reconciliation for more than 5 minutes.
202+
- **High failure rate:** Sustained increase in `node_readiness_failures_total`.
203+
- **Low availability:** Ready-node percentage below your target threshold for a sustained period.

0 commit comments

Comments
 (0)