Skip to content

Commit df24a64

Browse files
authored
cleanup: doc updates on getting-started guide. (#135)
merging doc only updates
1 parent 7215277 commit df24a64

File tree

2 files changed

+80
-123
lines changed

2 files changed

+80
-123
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,13 @@ Find a more detailed walkthrough of setting up Node Readiness Controller in your
6464
6565
## High-level Roadmap
6666
67-
- [ ] Add documentation capturing design details
68-
- [ ] Metrics and alerting integration
69-
- [ ] Validation Webhook for rules
67+
- [X] Release v0.1.0
68+
- [X] Add documentation capturing design details
69+
- [X] Metrics and alerting integration
70+
- [X] Validation Webhook for rules
7071
- [ ] Improve logging and add debugging pointers
7172
- [ ] Performance optimizations for large clusters
7273
- [ ] Scale testing 1000+ nodes
73-
- [ ] Release v0.1.0
7474
7575
## Getting Involved
7676
Lines changed: 76 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,132 +1,90 @@
1+
# Getting Started
12

2-
## Getting Started
3-
4-
This guide covers creating and configuring `NodeReadinessRule` resources.
3+
This guide covers how to use the Node Readiness Controller to define and enforce node readiness checks using `NodeReadinessRule` resources.
54

65
> **Prerequisites**: Node Readiness Controller must be installed. See [Installation](./installation.md).
76
8-
### API Spec
7+
## Creating a Readiness Rule
8+
9+
The core resource is the `NodeReadinessRule` CRD. It defines a set of conditions
10+
that a node must meet to be considered "workload ready". If the conditions are
11+
not met, the controller applies a specific taint to the node.
912

10-
#### Example: Storage Readiness Rule (Bootstrap-only)
13+
### Basic Example: Storage Readiness
1114

12-
This rule ensures nodes have working storage before removing the storage readiness taint:
15+
Here is a rule that ensures a storage plugin is registered before allowing workloads that need it.
1316

1417
```yaml
1518
apiVersion: readiness.node.x-k8s.io/v1alpha1
1619
kind: NodeReadinessRule
1720
metadata:
18-
name: nfs-storage-readiness-rule
21+
name: storage-readiness-rule
1922
spec:
20-
conditions:
21-
- type: "csi.example.net/NodePluginRegistered"
22-
requiredStatus: "True"
23-
- type: "csi.example.net/BackendReachable"
24-
requiredStatus: "True"
25-
- type: "DiskPressure"
26-
requiredStatus: "False"
27-
taint:
28-
key: "readiness.k8s.io/vendor.com/nfs-unhealthy"
29-
effect: "NoSchedule"
30-
enforcementMode: "bootstrap-only"
23+
# The label selector determines which nodes this rule applies to
3124
nodeSelector:
3225
matchLabels:
3326
storage-backend: "nfs"
34-
dryRun: true # Preview mode
35-
```
36-
37-
#### NodeReadinessRule
38-
39-
| Field | Description | Required |
40-
|-------|-------------|----------|
41-
| `conditions` | List of node conditions that must ALL be satisfied | Yes |
42-
| `conditions[].type` | Node condition type to evaluate | Yes |
43-
| `conditions[].requiredStatus` | Required condition status (`True`, `False`, `Unknown`) | Yes |
44-
| `taint.key` | Taint key to manage | Yes |
45-
| `taint.effect` | Taint effect (`NoSchedule`, `PreferNoSchedule`, `NoExecute`) | Yes |
46-
| `taint.value` | Optional taint value | No |
47-
| `enforcementMode` | `bootstrap-only` or `continuous` | Yes |
48-
| `nodeSelector` | Label selector to target specific nodes | No |
49-
| `dryRun` | Preview changes without applying them | No |
50-
51-
### Enforcement Modes
52-
53-
#### Bootstrap-only Mode
54-
- Removes bootstrap taint when conditions are first satisfied
55-
- Marks completion with node annotation
56-
- Stops monitoring after successful removal (fail-safe)
57-
- Ideal for one-time setup conditions (installing node daemons e.g: security agent or kernel-module update)
58-
59-
#### Continuous Mode
60-
- Continuously monitors conditions
61-
- Adds taint when any condition becomes unsatisfied
62-
- Removes taint when all conditions become satisfied
63-
- Ideal for ongoing health monitoring (network connectivity, resource availability)
6427

65-
## Operations
66-
67-
### Monitoring Rule Status
68-
69-
View rule status and evaluation results:
70-
71-
```sh
72-
# List all rules
73-
kubectl get nodereadinessrules
28+
# The conditions that must be True for the node to be considered ready
29+
conditions:
30+
- type: "csi.example.com/NodePluginRegistered"
31+
requiredStatus: "True"
32+
- type: "csi.example.com/BackendReachable"
33+
requiredStatus: "True"
7434

75-
# Detailed status of a specific rule
76-
kubectl describe nodereadinessrule network-readiness-rule
35+
# The taint to apply if conditions are NOT met
36+
taint:
37+
key: "readiness.k8s.io/vendor.com/nfs-unhealthy"
38+
effect: "NoSchedule"
7739

78-
# Check rule evaluation per node
79-
kubectl get nodereadinessrule network-readiness-rule -o yaml
40+
# When to enforce: 'bootstrap-only' (initial setup) or 'continuous' (ongoing health)
41+
enforcementMode: "continuous"
8042
```
8143
82-
The status includes:
83-
- `appliedNodes`: Nodes this rule targets
84-
- `failedNodes`: Nodes with evaluation errors
85-
- `nodeEvaluations`: Per-node condition evaluation results
86-
- `dryRunResults`: Impact analysis for dry-run rules
44+
## Configuring the Rule
8745
88-
### Dry Run Mode
46+
### 1. Select Target Nodes
47+
Use the `nodeSelector` to target specific nodes (eg., GPU nodes).
8948

90-
Test rules safely before applying:
49+
> **Note**: These labels could be configured at node registration (e.g., via Kubelet `--node-labels`). Relying on labels added asynchronously by addons (like Node Feature Discovery) can create a race condition where the node remains schedulable until the labels appear.
9150

92-
```yaml
93-
spec:
94-
dryRun: true # Enable dry run mode
95-
conditions:
96-
- type: "csi.example.net/NodePluginRegistered"
97-
requiredStatus: "True"
98-
# ... rest of spec
99-
```
51+
### 2. Define Readiness Conditions
52+
The `conditions` list defines the criteria. The controller watches the Node's status for these conditions.
53+
* `type`: The exact string matching the NodeCondition type.
54+
* `requiredStatus`: The status required (`True`, `False`, or `Unknown`).
10055

101-
Check dry run results:
56+
### 3. Choose an Enforcement Mode
57+
The `enforcementMode` determines how the controller manages the taint lifecycle.
10258

103-
```sh
104-
kubectl get nodereadinessrule <rule-name> -o jsonpath='{.status.dryRunResults}'
105-
```
59+
* **`bootstrap-only`**: Use this for one-time initialization tasks (e.g., installing a kernel module or driver). Once the conditions are met once, the taint is removed and never reapplied.
60+
* **`continuous`**: Use this for ongoing health checks (e.g., network connectivity). If the condition fails at any time, the taint is reapplied.
10661

107-
### Rule Validation and Constraints
62+
> For more details on these modes, see [Concepts](./concepts.md#enforcement-modes).
10863

109-
#### NoExecute Taint Effect Warning
64+
### 4. Configure the Taint
65+
Define the taint that will block scheduling.
66+
* **Key**: Must start with `readiness.k8s.io/` prefix.
67+
* **Effect**:
68+
* `NoSchedule`: Prevents new pods from scheduling (Recommended).
69+
* `PreferNoSchedule`: Tries to avoid scheduling.
70+
* `NoExecute`: Evicts running pods if they don't tolerate the taint.
11071

111-
**`NoExecute` with `continuous` enforcement mode will evict existing workloads when conditions fail.**
72+
> **Note**: To eliminate startup race conditions, register nodes with this taint (e.g., via Kubelet `--register-with-taints`). The controller will remove it once conditions are met.
11273

113-
If a readiness condition on the node is failing temporarily (eg., the component restarted), all pods without matching tolerations are immediately evicted from the node, if configured with a `NoExecute` taint. Use `NoSchedule` to prevent new scheduling without disrupting running workloads.
74+
> **Caution**: When using `NoExecute` with `continuous` mode: if a condition
75+
> fails momentarily, all workloads on the node (without tolerations) will be
76+
> immediately evicted, which can cause service disruption.
11477

115-
The admission webhook warns when using `NoExecute`:
11678

117-
```sh
118-
# NoExecute + continuous enforcement
119-
$ kubectl apply -f rule.yaml
120-
Warning: CAUTION: NoExecute with continuous mode evicts pods when conditions fail, risking workload disruption. Consider NoSchedule or bootstrap-only
121-
nodereadinessrule.readiness.node.x-k8s.io/my-rule created
79+
The admission webhook warns when using `NoExecute` taint:
12280

123-
# NoExecute + bootstrap-only enforcement
81+
```bash
12482
$ kubectl apply -f rule.yaml
12583
Warning: NOTE: NoExecute will evict existing pods without tolerations. Ensure critical system pods have appropriate tolerations
12684
nodereadinessrule.readiness.node.x-k8s.io/my-rule created
12785
```
12886

129-
See [Kubernetes taints documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) for taint behavior details.
87+
### Rule Validations
13088

13189
#### Avoiding Taint Key Conflicts
13290

@@ -159,13 +117,11 @@ spec:
159117
effect: "NoSchedule"
160118
```
161119

162-
Use unique, descriptive taint keys for different readiness checks.
163-
164120
#### Taint Key Naming
121+
Taint keys must have the `readiness.k8s.io/` prefix to clearly identify
122+
readiness-related taints and avoid conflicts with other controllers.
123+
Use unique, descriptive taint keys for different readiness checks. Follow [Kubernetes naming conventions](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/).
165124

166-
Follow [Kubernetes naming conventions](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/).
167-
168-
Taint keys must have the `readiness.k8s.io/` prefix to clearly identify readiness-related taints and avoid conflicts with other controllers
169125

170126
**Valid:**
171127
```yaml
@@ -182,31 +138,32 @@ taint:
182138
```
183139

184140

185-
## Configuration
141+
## Testing with Dry Run
186142

187-
### Performance and Scalability
143+
You can preview the impact of a rule without actually tainting nodes using `dryRun`.
188144

189-
- **Memory Usage**: ~64MB base + ~1KB per node + ~2KB per rule
190-
- **CPU Usage**: Minimal during steady state, scales with node/rule change frequency
191-
- **Node Scale**: Tested up to 100 nodes using kwok (1k nodes in progress)
192-
- **Rule Scale**: Recommended maximum 50 rules per cluster
193-
194-
### Integration Patterns
195-
196-
#### With Node Problem Detector
197145
```yaml
198-
# custom NPD plugin checks and sets node conditions, controller manages taints
199-
conditions:
200-
- type: "readiness.k8s.io/NetworkReady" # Set by NPD
201-
requiredStatus: "False"
146+
spec:
147+
dryRun: true # Enable dry run mode
148+
conditions:
149+
- type: "csi.example.com/NodePluginRegistered"
150+
requiredStatus: "True"
202151
```
203152

204-
#### With Custom Health Checkers
205-
```yaml
206-
# Your daemonset sets custom conditions
207-
conditions:
208-
- type: "readiness.k8s.io/mycompany.example.com/DatabaseReady"
209-
requiredStatus: "True"
210-
- type: "readiness.k8s.io/mycompany.example.com/CacheWarmed"
211-
requiredStatus: "True"
153+
Check the `status` of the rule to follow the results:
154+
155+
```sh
156+
kubectl get nodereadinessrule my-rule -o yaml
212157
```
158+
159+
Look for `dryRunResults` in the output to see which nodes would be tainted.
160+
161+
## Reporting Node Conditions
162+
163+
The Node Readiness Controller only 'reacts' to observed conditions on the Node object. These conditions can be set by various tools:
164+
165+
1. **Node Problem Detector (NPD)**: You can configure NPD with [custom plugins](https://github.com/kubernetes/node-problem-detector/blob/master/docs/custom_plugin_monitor.md) to monitor system state and report conditions.
166+
2. **Custom Health-Checkers or Sidecars**: You can run a daemonset or a small sidecar (eg., [Readiness Condition Reporter](../examples/security-agent-readiness.md#1-deploy-the-readiness-condition-reporter)) that checks your application or driver and updates the Node status.
167+
3. **External Controllers**: Any tool that can patch Node status can trigger these rules.
168+
169+
For a full example of setting up a custom condition for a security agent, see the [Security Agent Readiness Example](../examples/security-agent-readiness.md).

0 commit comments

Comments
 (0)