Skip to content

Commit 243aa89

Browse files
authored
update cni readiness example to use daemonset for reporter (#116)
1 parent 13a3729 commit 243aa89

File tree

9 files changed

+109
-91
lines changed

9 files changed

+109
-91
lines changed

docs/TEST_README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,8 @@ hack/test-workloads/apply-calico.sh
8888

8989
1. **Check for the new node condition on the application node:**
9090
```bash
91-
# Look for 'network.k8s.io/CalicoReady True'
91+
kubectl get node $NODE -o json | jq '.status.conditions[] | select(.type=="projectcalico.org/CalicoReady")'
92+
# Look for 'projectcalico.org/CalicoReady True'
9293
kubectl get node nrr-test-worker2 -o jsonpath='Conditions:{"\n"}{range .status.conditions[*]}{.type}{"\t"}{.status}{"\n"}{end}'
9394
```
9495

docs/book/src/examples/cni-readiness.md

Lines changed: 36 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This guide demonstrates how to use the Node Readiness Controller to prevent pods
99

1010
The high-level steps are:
1111
1. Node is bootstrapped with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/NetworkReady=pending:NoSchedule` immediately upon joining.
12-
2. A sidecar is patched to the cni-agent to monitor the CNI's health and report it to the API server as node-condition (`network.k8s.io/CalicoReady`).
12+
2. A reporter DaemonSet is deployed to monitor the CNI's health and report it to the API server as node-condition (`projectcalico.org/CalicoReady`).
1313
3. Node Readiness Controller will untaint the node only when the CNI reports it is ready.
1414

1515
## Step-by-Step Guide
@@ -20,43 +20,42 @@ This example uses **Calico**, but the pattern applies to any CNI.
2020
2121
### 1. Deploy the Readiness Condition Reporter
2222

23-
We need to bridge Calico's internal health status to a Kubernetes Node Condition. We will add a **sidecar container** to the Calico DaemonSet.
23+
We need to bridge Calico's internal health status to a Kubernetes Node Condition. We will deploy a **reporter DaemonSet** that runs on every node.
2424

25-
This sidecar checks Calico's local health endpoint (`http://localhost:9099/readiness`) and updates a node condition `network.k8s.io/CalicoReady`.
25+
This reporter checks Calico's local health endpoint (`http://localhost:9099/readiness`) and updates a node condition `projectcalico.org/CalicoReady`.
2626

27-
**Patch your Calico DaemonSet:**
27+
Using a separate DaemonSet instead of a sidecar ensures that readiness reporting works even if the CNI pod is crashlooping or failing to start containers.
28+
29+
**Deploy the Reporter DaemonSet:**
2830

2931
```yaml
30-
# cni-patcher-sidecar.yaml
31-
- name: cni-status-patcher
32-
image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
33-
imagePullPolicy: IfNotPresent
34-
env:
35-
- name: NODE_NAME
36-
valueFrom:
37-
fieldRef:
38-
fieldPath: spec.nodeName
39-
- name: CHECK_ENDPOINT
40-
value: "http://localhost:9099/readiness" # update to your CNI health endpoint
41-
- name: CONDITION_TYPE
42-
value: "network.k8s.io/CalicoReady" # update this node condition
43-
- name: CHECK_INTERVAL
44-
value: "15s"
45-
resources:
46-
limits:
47-
cpu: "10m"
48-
memory: "32Mi"
49-
requests:
50-
cpu: "10m"
51-
memory: "32Mi"
32+
# cni-reporter-ds.yaml
33+
apiVersion: apps/v1
34+
kind: DaemonSet
35+
metadata:
36+
name: cni-reporter
37+
namespace: kube-system
38+
spec:
39+
# ...
40+
template:
41+
spec:
42+
hostNetwork: true
43+
serviceAccountName: cni-reporter
44+
tolerations:
45+
- operator: Exists
46+
containers:
47+
- name: cni-status-patcher
48+
image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
49+
env:
50+
- name: CHECK_ENDPOINT
51+
value: "http://localhost:9099/readiness"
52+
- name: CONDITION_TYPE
53+
value: "projectcalico.org/CalicoReady"
5254
```
5355
54-
> Note: In this example, the CNI pod health is monitored by a side-car, so watcher's lifecycle is same as the pod lifecycle.
55-
If the Calico pod is crashlooping, the sidecar will not run and cannot report readiness. For robust 'continuous' readiness reporting, the watcher should be 'external' to the pod.
56-
5756
### 2. Grant Permissions (RBAC)
5857
59-
The sidecar needs permission to update the Node object's status.
58+
The reporter needs permission to update the Node object's status.
6059
6160
```yaml
6261
# calico-rbac-node-status-patch-role.yaml
@@ -78,15 +77,15 @@ roleRef:
7877
kind: ClusterRole
7978
name: node-status-patch-role
8079
subjects:
81-
# Bind to CNI's ServiceAccount
80+
# Bind to CNI Reporter's ServiceAccount
8281
- kind: ServiceAccount
83-
name: calico-node
82+
name: cni-reporter
8483
namespace: kube-system
8584
```
8685
8786
### 3. Create the Node Readiness Rule
8887
89-
Now define the rule that enforces the requirement. This tells the controller: *"Keep the `readiness.k8s.io/NetworkReady` taint on the node until `network.k8s.io/CalicoReady` is True."*
88+
Now define the rule that enforces the requirement. This tells the controller: *"Keep the `readiness.k8s.io/NetworkReady` taint on the node until `projectcalico.org/CalicoReady` is True."*
9089

9190
```yaml
9291
# network-readiness-rule.yaml
@@ -97,7 +96,7 @@ metadata:
9796
spec:
9897
# The condition(s) to monitor
9998
conditions:
100-
- type: "network.k8s.io/CalicoReady"
99+
- type: "projectcalico.org/CalicoReady"
101100
requiredStatus: "True"
102101
103102
# The taint to manage
@@ -139,8 +138,8 @@ To test this, add a new node to the cluster.
139138
`readiness.k8s.io/NetworkReady=pending:NoSchedule`.
140139

141140
2. **Check Node Conditions**:
142-
Watch the node conditions. You will initially see `network.k8s.io/CalicoReady` as `False` or missing.
143-
Once Calico starts, the sidecar will update it to `True`.
141+
Watch the node conditions. You will initially see `projectcalico.org/CalicoReady` as `False` or missing.
142+
Once Calico starts, the reporter will update it to `True`.
144143

145144
3. **Check Taint Removal**:
146-
As soon as the condition becomes `True`, the Node Readiness Controller will remove the taint, and workloads will be scheduled.
145+
As soon as the condition becomes `True`, the Node Readiness Controller will remove the taint, and workloads will be scheduled.

examples/cni-readiness/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# CNI Readiness Example (Calico)
2+
3+
This example demonstrates how to use the Node Readiness Controller to ensure nodes are only marked ready for workloads after the CNI (Calico) has fully initialized.
4+
5+
### How it works:
6+
1. Nodes join with a `readiness.k8s.io/NetworkReady=pending:NoSchedule` taint.
7+
2. A lightweight DaemonSet (`cni-reporter-ds.yaml`)
8+
monitors Calico's health endpoint (`localhost:9099/readiness`) and updates a
9+
node condition `projectcalico.org/CalicoReady`.
10+
3. The `NodeReadinessRule` (`network-readiness-rule.yaml`) instructs the controller to remove the startup taint once the `projectcalico.org/CalicoReady` condition becomes `True`.
11+
4. The reporter is deployed with `hostNetwork: true` to reach Calico's local health endpoint.
12+
5. The reporter needs a dedicated ServiceAccount (`cni-reporter`) with permissions to patch node status.

examples/cni-readiness/apply-calico.sh

Lines changed: 4 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -18,41 +18,16 @@ set -e
1818

1919
KUBECTL_ARGS="$@"
2020

21-
YQ_VERSION="v4.48.1"
22-
YQ_PATH="/tmp/yq"
23-
24-
# Check if yq is installed, if not download it.
25-
if [ ! -f "$YQ_PATH" ]; then
26-
echo "yq not found at $YQ_PATH, downloading..."
27-
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
28-
ARCH=$(uname -m)
29-
case $ARCH in
30-
x86_64)
31-
ARCH="amd64"
32-
;;
33-
aarch64|arm64)
34-
ARCH="arm64"
35-
;;
36-
*)
37-
echo "Unsupported architecture: $ARCH"
38-
exit 1
39-
;;
40-
esac
41-
YQ_BINARY="yq_${OS}_${ARCH}"
42-
curl -sL "https://github.com/mikefarah/yq/releases/download/${YQ_VERSION}/${YQ_BINARY}" -o "$YQ_PATH"
43-
chmod +x "$YQ_PATH"
44-
fi
45-
4621
# Download the Calico manifest
4722
curl -sL https://raw.githubusercontent.com/projectcalico/calico/v3.30.1/manifests/calico.yaml -o calico.yaml
4823

49-
# Add the cni-status-patcher sidecar
50-
"$YQ_PATH" e -i 'select(.kind == "DaemonSet" and .metadata.name == "calico-node").spec.template.spec.containers += [load("hack/test-workloads/cni-patcher-sidecar.yaml")]' calico.yaml
51-
5224
# Apply the manifest twice. The first time, it will create the CRDs and ServiceAccounts.
5325
# The second time, it will create the rest of the resources, which should now be able to find the ServiceAccount.
5426
kubectl apply $KUBECTL_ARGS -f calico.yaml || true
5527
kubectl apply $KUBECTL_ARGS -f calico.yaml
5628

29+
# Apply the CNI readiness reporter DaemonSet
30+
kubectl apply $KUBECTL_ARGS -f ./cni-reporter-ds.yaml
31+
5732
# Apply the RBAC rules
58-
kubectl apply $KUBECTL_ARGS -f ./calico-rbac-node-status-patch-role.yaml
33+
kubectl apply $KUBECTL_ARGS -f ./calico-rbac-node-status-patch-role.yaml

examples/cni-readiness/calico-rbac-node-status-patch-role.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ kind: ClusterRole
33
metadata:
44
name: node-status-patch-role
55
rules:
6+
- apiGroups: [""]
7+
resources: ["nodes"]
8+
verbs: ["get"]
69
- apiGroups: [""]
710
resources: ["nodes/status"]
811
verbs: ["patch", "update"]
@@ -16,6 +19,7 @@ roleRef:
1619
kind: ClusterRole
1720
name: node-status-patch-role
1821
subjects:
22+
# Bind to CNI Reporter's ServiceAccount
1923
- kind: ServiceAccount
20-
name: calico-node
24+
name: cni-reporter
2125
namespace: kube-system

examples/cni-readiness/cni-patcher-sidecar.yaml

Lines changed: 0 additions & 21 deletions
This file was deleted.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
apiVersion: v1
2+
kind: ServiceAccount
3+
metadata:
4+
name: cni-reporter
5+
namespace: kube-system
6+
---
7+
apiVersion: apps/v1
8+
kind: DaemonSet
9+
metadata:
10+
name: cni-reporter
11+
namespace: kube-system
12+
labels:
13+
app: cni-reporter
14+
spec:
15+
selector:
16+
matchLabels:
17+
app: cni-reporter
18+
template:
19+
metadata:
20+
labels:
21+
app: cni-reporter
22+
spec:
23+
hostNetwork: true
24+
serviceAccountName: cni-reporter
25+
tolerations:
26+
- operator: Exists
27+
containers:
28+
- name: cni-status-patcher
29+
image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
30+
imagePullPolicy: IfNotPresent
31+
env:
32+
- name: NODE_NAME
33+
valueFrom:
34+
fieldRef:
35+
fieldPath: spec.nodeName
36+
- name: CHECK_ENDPOINT
37+
value: "http://localhost:9099/readiness"
38+
- name: CONDITION_TYPE
39+
value: "projectcalico.org/CalicoReady"
40+
- name: CHECK_INTERVAL
41+
value: "5s"
42+
resources:
43+
limits:
44+
cpu: "10m"
45+
memory: "32Mi"
46+
requests:
47+
cpu: "10m"
48+
memory: "32Mi"

examples/cni-readiness/network-readiness-dryrun-rule.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ metadata:
55
spec:
66
dryRun: true
77
conditions:
8-
- type: "network.k8s.io/CalicoReady"
8+
- type: "projectcalico.org/CalicoReady"
99
requiredStatus: "True"
1010
taint:
1111
key: "readiness.k8s.io/NetworkReady"

examples/cni-readiness/network-readiness-rule.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ metadata:
44
name: network-readiness-rule
55
spec:
66
conditions:
7-
- type: "network.k8s.io/CalicoReady"
7+
- type: "projectcalico.org/CalicoReady"
88
requiredStatus: "True"
99
taint:
1010
key: "readiness.k8s.io/NetworkReady"

0 commit comments

Comments
 (0)