kubernetes-sigs
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎cmd/readiness-condition-reporter/main_test.go‎
Lines changed: 1 addition & 1 deletion b/‎cmd/readiness-condition-reporter/main_test.go‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/TEST_README.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/TEST_README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/book/src/operations/monitoring.md‎
Lines changed: 100 additions & 165 deletions b/‎docs/book/src/operations/monitoring.md‎
Lines changed: 100 additions & 165 deletions
@@ -196,7 +196,7 @@ docker-push: ## Push docker image with the manager.
 # architectures. (i.e. make docker-buildx IMG_PREFIX=myregistry/mypoperator IMG_TAG=0.0.1). To use this option you need to:
 # - be able to use docker buildx. More info: https://docs.docker.com/build/buildx/
 # - have enabled BuildKit. More info: https://docs.docker.com/develop/develop-images/build_enhancements/
-# - be able to push the image to your registry (i.e. if you do not set a valid value via IMG_PREFIX:${IMG_TAG} then the export will fail)
+# - be able to push the image to your registry (i.e. if you do not set a valid value via ${IMG_PREFIX}:${IMG_TAG} then the export will fail)
 # To adequately provide solutions that are compatible with multiple platforms, you should consider using this option.
 PLATFORMS ?= linux/arm64,linux/amd64
 .PHONY: docker-buildx
 
@@ -154,8 +154,8 @@ func TestUpdateNodeCondition(t *testing.T) {
 
 			if foundCondition == nil {
 				t.Fatal("Condition not found")
+				return
 			}
-
 			if foundCondition.Status != tt.wantStatus {
 				t.Errorf("Condition status = %v, want %v", foundCondition.Status, tt.wantStatus)
 			}
 
@@ -154,3 +154,4 @@ This section tests how the controller handles new nodes being added to the clust
 
 ```bash
 kind delete cluster --name nrr-test
+```
@@ -1,18 +1,48 @@
 # Monitoring
 
-Node Readiness Controller exposes Prometheus-compatible metrics. This page describes the Prometheus metrics exposed by Node Readiness Controller for monitoring rule evaluation, taint operations, failures, and bootstrap progress.
-
-The controller now includes **enhanced SLI metrics** for tracking uptime, freshness, and lag - essential for production monitoring and SLO compliance.
+The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.
 
 ## Metrics Endpoint
 
-The controller serves metrics on `/metrics` only when metrics are explicitly enabled. Depending on the installation, the endpoint is served either over HTTP or over HTTPS. See [Installation](../user-guide/installation.md) for deployment details.
+The controller serves metrics on `/metrics` only when metrics are explicitly enabled.
+
+Depending on the installation, the endpoint is exposed as:
+
+  - HTTP on port `8080` when the standard Prometheus component is enabled.
+  - HTTPS on port `8443` when the Prometheus TLS component is enabled.
+
+See [Installation](https://www.google.com/search?q=../user-guide/installation.md) for deployment details.
+
+## Metric Lifecycle Management
+
+When a `NodeReadinessRule` is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.
+
+**Metrics cleaned up on rule deletion:**
+
+  - `node_readiness_taint_operations_total{rule="..."}`
+  - `node_readiness_evaluation_duration_seconds{rule="..."}`
+  - `node_readiness_failures_total{rule="..."}`
+  - `node_readiness_bootstrap_completed_total{rule="..."}`
+  - `node_readiness_reconciliation_latency_seconds{rule="..."}`
+  - `node_readiness_bootstrap_duration_seconds{rule="..."}`
+  - `node_readiness_nodes_by_state{rule="..."}`
+  - `node_readiness_rule_last_reconciliation_timestamp_seconds{rule="..."}`
+
+This ensures that:
+
+  - Deleted rules do not continue to appear in dashboards with stale values.
+  - Memory usage does not grow unbounded from removed rules.
+  - Metric cardinality remains highly accurate over time.
+
+**Note:** The global `node_readiness_rules_total` gauge is updated separately. Rule-labeled metrics are explicitly deleted during rule cleanup.
+
+-----
 
 ## Core Metrics
 
 ### `node_readiness_rules_total`
 
-Number of `NodeReadinessRule` objects tracked by the controller.
+Number of `NodeReadinessRule` objects currently tracked by the controller.
 
 | Property | Value |
 | --- | --- |
@@ -27,24 +57,17 @@ Total number of taint operations performed by the controller.
 | Property | Value |
 | --- | --- |
 | Type | `counter` |
-| Labels | `rule`, `operation` |
+| Labels | `rule`, `operation` (`add`, `remove`) |
 | Recorded when | The controller successfully adds or removes a taint |
 
-#### Labels
-
-| Label | Description | Values |
-| --- | --- | --- |
-| `rule` | `NodeReadinessRule` name | Any rule name |
-| `operation` | Taint operation performed by the controller | `add`, `remove` |
-
 ### `node_readiness_evaluation_duration_seconds`
 
-Duration of rule evaluations.
+Duration of the controller's internal rule evaluations.
 
 | Property | Value |
 | --- | --- |
 | Type | `histogram` |
-| Labels | none |
+| Labels | `rule` |
 | Buckets | Prometheus default histogram buckets |
 | Recorded when | The controller evaluates a rule against a node |
 
@@ -55,214 +78,126 @@ Total number of failure events recorded by the controller.
 | Property | Value |
 | --- | --- |
 | Type | `counter` |
-| Labels | `rule`, `reason` |
-| Recorded when | The controller records an evaluation failure or taint add/remove failure |
+| Labels | `rule`, `reason` (`EvaluationError`, `AddTaintError`, `RemoveTaintError`) |
+| Recorded when | The controller encounters an error evaluating or patching a node |
 
-#### Labels
+### `node_readiness_bootstrap_completed_total`
 
-| Label | Description | Values |
-| --- | --- | --- |
-| `rule` | `NodeReadinessRule` name | Any rule name |
+Total number of nodes that have completed bootstrap.
 
-## Enhanced SLI Metrics
+| Property | Value |
+| --- | --- |
+| Type | `counter` |
+| Labels | `rule` |
+| Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
 
-The controller exposes additional metrics for fine-grained monitoring across three key dimensions:
+-----
 
-### Latency Metrics
+## Extended Health and SLI Metrics
 
-#### `node_readiness_reconciliation_latency_seconds`
+### `node_readiness_reconciliation_latency_seconds`
 
 End-to-end latency from node condition change to taint operation completion.
 
 | Property | Value |
 | --- | --- |
 | Type | `histogram` |
-| Labels | `rule`, `operation` |
-| Buckets | 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300 seconds |
-| Recorded when | A taint operation (add/remove) completes |
+| Labels | `rule`, `operation` (`add_taint`, `remove_taint`) |
+| Buckets | `0.01`, `0.05`, `0.1`, `0.5`, `1`, `2`, `5`, `10`, `30`, `60`, `120`, `300` seconds |
+| Recorded when | A taint operation completes |
 
-**Use Case:** Measure how quickly the controller responds to condition changes. Critical for understanding end-to-end reconciliation performance.
+**Use case:** Measure how quickly the controller responds to node condition changes in the cluster.
 
-#### `node_readiness_reconciliation_duration_seconds`
+### `node_readiness_bootstrap_duration_seconds`
 
-Duration of complete reconciliation cycle including all operations.
+Time from node creation to bootstrap completion for bootstrap-only rules.
 
 | Property | Value |
 | --- | --- |
 | Type | `histogram` |
-| Labels | `rule`, `node` |
-| Buckets | 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30 seconds |
-| Recorded when | A node reconciliation completes |
-
-**Use Case:** Monitor full reconciliation performance per node. Different from evaluation duration as it includes all operations.
-
-### Freshness Metrics
-
-#### `node_readiness_last_reconciliation_timestamp_seconds`
-
-Unix timestamp of the last successful reconciliation for a node and rule.
-
-| Property | Value |
-| --- | --- |
-| Type | `gauge` |
-| Labels | `rule`, `node` |
-| Recorded when | A node reconciliation completes successfully |
+| Labels | `rule` |
+| Buckets | `1`, `5`, `10`, `30`, `60`, `120`, `300`, `600`, `1200` seconds |
+| Recorded when | Bootstrap completion is observed for a node under a bootstrap-only rule |
 
-**Use Case:** Identify stale reconciliations and detect when nodes stop being reconciled.
+**Use case:** Measure the actual time nodes take to become fully provisioned and bootstrap-complete.
 
-#### `node_readiness_reconciliation_lag_seconds`
+### `node_readiness_nodes_by_state`
 
-Time since last reconciliation for a node and rule (freshness indicator).
+Number of nodes in each readiness state per rule.
 
 | Property | Value |
 | --- | --- |
 | Type | `gauge` |
-| Labels | `rule`, `node` |
-| Recorded when | A node reconciliation completes |
+| Labels | `rule`, `state` (`ready`, `not_ready`, `bootstrapping`) |
+| Recorded when | A rule reconciliation completes |
 
-**Use Case:** Monitor reconciliation freshness and detect lag in processing. Essential for detecting performance degradation.
+**Use case:** Track aggregate node health without introducing per-node metric cardinality, keeping controller memory footprint lean.
 
-#### `node_readiness_condition_transition_timestamp_seconds`
+### `node_readiness_rule_last_reconciliation_timestamp_seconds`
 
-Unix timestamp of the last condition transition for a node.
+Unix timestamp of the last reconciliation for a rule.
 
 | Property | Value |
 | --- | --- |
 | Type | `gauge` |
-| Labels | `rule`, `node`, `condition_type` |
-| Recorded when | A node condition is evaluated |
-
-**Use Case:** Measure freshness of condition data and detect stale conditions.
-
-### Uptime / Availability Metrics
-
-#### `node_readiness_evaluations_total`
-
-Total number of rule evaluations performed per node.
-
-| Property | Value |
-| --- | --- |
-| Type | `counter` |
-| Labels | `rule`, `node` |
-| Recorded when | A rule evaluation starts |
-
-**Use Case:** Understand reconciliation frequency and controller uptime. Useful for detecting when reconciliations stop.
-
-#### `node_readiness_taint_state`
-
-Current taint state for a node and rule.
-
-| Property | Value |
-| --- | --- |
-| Type | `gauge` |
-| Labels | `rule`, `node`, `taint_key` |
-| Values | 1 (taint present), 0 (taint absent) |
-| Recorded when | A node reconciliation completes |
-
-**Use Case:** Track taint presence for uptime/availability monitoring. Enables real-time visibility into node readiness state.
+| Labels | `rule` |
+| Recorded when | A rule reconciliation loop successfully completes |
 
-#### `node_readiness_conditions_satisfied`
+**Use case:** Detect rules that may be stuck or not reconciling frequently enough.
 
-Whether all conditions are satisfied for a node and rule.
+-----
 
-| Property | Value |
-| --- | --- |
-| Type | `gauge` |
-| Labels | `rule`, `node` |
-| Values | 1 (all satisfied), 0 (not satisfied) |
-| Recorded when | A node reconciliation completes |
+## Example Queries & SLOs
 
-**Use Case:** Track node readiness state for SLO monitoring. Essential for availability tracking.
+### Latency Monitoring & SLOs
 
-## Example Queries
+**Objective:** 95% of internal evaluations complete within 50 milliseconds (0.05s).
 
-### Latency Monitoring
+```promql
+# Percentage of evaluations completing within 50ms
+sum(rate(node_readiness_evaluation_duration_seconds_bucket{le="0.05"}[5m])) /
+sum(rate(node_readiness_evaluation_duration_seconds_count[5m])) * 100
+```
 
 ```promql
-# P95 reconciliation latency
+# P95 End-to-End Reconciliation Latency across all rules
 histogram_quantile(0.95,
-  rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
+  sum by (le) (
+    rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
+  )
 )
-
-# Slow reconciliations (>5s)
-sum(rate(node_readiness_reconciliation_duration_seconds_bucket{le="5"}[5m]))
 ```
 
-### Freshness Monitoring
+### Freshness Monitoring & SLOs
 
-```promql
-# Average reconciliation lag by rule
-avg(node_readiness_reconciliation_lag_seconds) by (rule)
-
-# Nodes not reconciled in last 5 minutes
-(time() - node_readiness_last_reconciliation_timestamp_seconds) > 300
-```
-
-### Availability Monitoring
+**Objective:** All rules reconcile within the last 2 minutes.
 
 ```promql
-# Percentage of nodes with satisfied conditions
-avg(node_readiness_conditions_satisfied) by (rule) * 100
-
-# Count of tainted (not ready) nodes
-sum(node_readiness_taint_state) by (rule)
+# Alert if any rule has not reconciled in the last 120 seconds
+(time() - node_readiness_rule_last_reconciliation_timestamp_seconds) > 120
 ```
 
-## SLI/SLO Examples
+### Availability Monitoring & SLOs
 
-### Latency SLO
-**Objective:** 95% of reconciliations complete within 5 seconds
+**Objective:** 99.9% of targeted nodes are ready.
 
 ```promql
-sum(rate(node_readiness_reconciliation_duration_seconds_bucket{le="5"}[5m])) /
-sum(rate(node_readiness_reconciliation_duration_seconds_count[5m])) * 100
-```
-
-### Freshness SLO
-**Objective:** 99% of nodes reconciled within last 2 minutes
-
-```promql
-sum(node_readiness_reconciliation_lag_seconds < 120) /
-count(node_readiness_reconciliation_lag_seconds) * 100
-```
+# Percentage of ready nodes globally
+100 * sum(node_readiness_nodes_by_state{state="ready"}) / sum(node_readiness_nodes_by_state)
 
-### Availability SLO
-**Objective:** 99.9% of nodes have conditions satisfied
-
-```promql
-avg(node_readiness_conditions_satisfied) * 100
+# Percentage of ready nodes per rule
+100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)
 ```
 
-## Grafana Dashboard
+## Monitoring and Scale Testing
 
-A comprehensive Grafana dashboard with panels for all metrics is available. See:
-- [Detailed Metrics Documentation](../../../metrics.md)
-- [Grafana Dashboard Panels](../../../grafana-dashboard-panels.md)
-- [Scale Testing Setup](../../../../hack/test-workloads/scale/README.md)
+For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the [scale testing guide](../../../../hack/test-workloads/scale/README.md).
 
 ## Alerting Recommendations
 
-Set up alerts for:
-- **High Latency:** P95 latency > 10s for 5 minutes
-- **Stale Reconciliations:** Lag > 5 minutes
-- **High Failure Rate:** Failures > 0.1/sec for 5 minutes
-- **Low Availability:** < 95% nodes ready for 10 minutes
-
-See the [detailed metrics documentation](../../../metrics.md) for complete alerting rule examples.
-| `reason` | Failure label recorded by the controller | `EvaluationError`, `AddTaintError`, `RemoveTaintError` |
-
-### `node_readiness_bootstrap_completed_total`
-
-Total number of nodes that have completed bootstrap.
-
-| Property | Value |
-| --- | --- |
-| Type | `counter` |
-| Labels | `rule` |
-| Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |
-
-#### Labels
+Typical alerts to consider:
 
-| Label | Description | Values |
-| --- | --- | --- |
-| `rule` | `NodeReadinessRule` name | Any rule name |
+  - **High latency:** P95 reconciliation latency above 10s for 5 minutes.
+  - **Stale reconciliations:** Any rule with no reconciliation for more than 5 minutes.
+  - **High failure rate:** Sustained increase in `node_readiness_failures_total`.
+  - **Low availability:** Ready-node percentage below your target threshold for a sustained period.
Original file line number	Diff line number	Diff line change
`@@ -154,8 +154,8 @@ func TestUpdateNodeCondition(t *testing.T) {`
`154`	`154`
`155`	`155`	`if foundCondition == nil {`
`156`	`156`	`t.Fatal("Condition not found")`
	`157`	`+ return`
`157`	`158`	`}`
`158`		`-`
`159`	`159`	`if foundCondition.Status != tt.wantStatus {`
`160`	`160`	`t.Errorf("Condition status = %v, want %v", foundCondition.Status, tt.wantStatus)`
`161`	`161`	`}`
Original file line number	Diff line number	Diff line change
`@@ -154,3 +154,4 @@ This section tests how the controller handles new nodes being added to the clust`
`154`	`154`
`155`	`155`	```bash
`156`	`156`	`kind delete cluster --name nrr-test`
	`157`	+```