Skip to content

Commit 16dac86

Browse files
authored
docs: add example for security agent readiness (#101)
1 parent dae1ba2 commit 16dac86

File tree

7 files changed

+302
-1
lines changed

7 files changed

+302
-1
lines changed

docs/book/src/SUMMARY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
- [CNI Installation](./examples/cni-readiness.md)
1515
<!-- - [Storage Drivers](./examples/storage-readiness.md) -->
16-
<!-- - [Security Agent](./examples/security-agent-readiness.md) -->
16+
- [Security Agent](./examples/security-agent-readiness.md)
1717
<!-- - [Device Drivers](./examples/dra-readiness.md) -->
1818

1919
# Releases
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Security Agent Readiness Guardrail
2+
3+
This guide demonstrates how to use the Node Readiness Controller to prevent workloads from being scheduled on a node until a security agent (for example, [Falco](https://github.com/falcosecurity/falco)) is fully initialized and actively monitoring the node.
4+
5+
## The Problem
6+
7+
In many Kubernetes clusters, security agents are deployed as DaemonSets. When a new node joins the cluster, there is a race condition:
8+
1. A new node joins the cluster and is marked `Ready` by the kubelet.
9+
2. The scheduler sees the node as `Ready` and considers the node eligible for workloads.
10+
3. However, the security agent on that node may still be starting or initializing.
11+
12+
**Result**: Application workloads may start running before node is security compliant, creating a blind spot where runtime threats, policy violations, or anomalous behavior may go undetected.
13+
14+
## The Solution
15+
16+
We can use the Node Readiness Controller to enforce a security readiness guardrail:
17+
1. **Taint** the node with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule` as soon as it joins the cluster.
18+
2. **Monitor** the security agent’s readiness using a sidecar and expose it as a Node Condition.
19+
3. **Untaint** the node only after the security agent reports that it is ready.
20+
21+
## Step-by-Step Guide (Falco Example)
22+
23+
This example uses **Falco** as a representative security agent, but the same pattern applies to any node-level security or monitoring agent.
24+
25+
> **Note**: All manifests referenced in this guide are available in the [`examples/security-agent-readiness`](https://github.com/kubernetes-sigs/node-readiness-controller/tree/main/examples/security-agent-readiness) directory.
26+
27+
28+
29+
### 1. Deploy the Readiness Condition Reporter
30+
31+
To bridge the security agent’s internal health signal to Kubernetes, we deploy a readiness reporter that updates a Node Condition. In this example, the reporter is deployed as a sidecar container in the Falco DaemonSet. Components that natively update Node conditions would not require this additional container.
32+
33+
This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`.
34+
35+
**Patch your Falco DaemonSet:**
36+
37+
```yaml
38+
# security-agent-reporter-sidecar.yaml
39+
- name: security-status-patcher
40+
image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
41+
imagePullPolicy: IfNotPresent
42+
env:
43+
- name: NODE_NAME
44+
valueFrom:
45+
fieldRef:
46+
fieldPath: spec.nodeName
47+
- name: CHECK_ENDPOINT
48+
value: "http://localhost:8765/healthz" # Update the right security agent endpoint
49+
- name: CONDITION_TYPE
50+
value: "falco.org/FalcoReady" # Update the right condition
51+
- name: CHECK_INTERVAL
52+
value: "5s"
53+
resources:
54+
limits:
55+
cpu: "10m"
56+
memory: "32Mi"
57+
requests:
58+
cpu: "10m"
59+
memory: "32Mi"
60+
```
61+
62+
> Note: In this example, the security agent’s health is monitored by a side-car, so the reporter’s lifecycle is the same as the pod lifecycle. If the Falco pod is crashlooping, the sidecar will not run and cannot report readiness. For robust `continuous` readiness reporting, the reporter should be deployed independently of the security agent pod. For example, a separate DaemonSet (similar to Node Problem Detector) can monitor the agent and update Node conditions even if the agent pod crashes.
63+
64+
### 2. Grant Permissions (RBAC)
65+
66+
The readiness reporter sidecar needs permission to update the Node object's status to publish readiness information.
67+
68+
```yaml
69+
# security-agent-node-status-rbac.yaml
70+
apiVersion: rbac.authorization.k8s.io/v1
71+
kind: ClusterRole
72+
metadata:
73+
name: node-status-patch-role
74+
rules:
75+
- apiGroups: [""]
76+
resources: ["nodes"]
77+
verbs: ["get"]
78+
- apiGroups: [""]
79+
resources: ["nodes/status"]
80+
verbs: ["patch", "update"]
81+
---
82+
apiVersion: rbac.authorization.k8s.io/v1
83+
kind: ClusterRoleBinding
84+
metadata:
85+
name: security-agent-node-status-patch-binding
86+
roleRef:
87+
apiGroup: rbac.authorization.k8s.io
88+
kind: ClusterRole
89+
name: node-status-patch-role
90+
subjects:
91+
# Bind to security agent's ServiceAccount
92+
- kind: ServiceAccount
93+
name: falco
94+
namespace: kube-system
95+
```
96+
97+
### 3. Create the Node Readiness Rule
98+
99+
Next, define a NodeReadinessRule that enforces the security readiness requirement. This rule instructs the controller: *"Keep the `readiness.k8s.io/falco.org/security-agent-ready` taint on the node until the `falco.org/FalcoReady` condition becomes True."*
100+
101+
```yaml
102+
# security-agent-readiness-rule.yaml
103+
apiVersion: readiness.node.x-k8s.io/v1alpha1
104+
kind: NodeReadinessRule
105+
metadata:
106+
name: security-agent-readiness-rule
107+
spec:
108+
# Conditions that must be satisfied before the taint is removed
109+
conditions:
110+
- type: "falco.org/FalcoReady"
111+
requiredStatus: "True"
112+
113+
# Taint managed by this rule
114+
taint:
115+
key: "readiness.k8s.io/falco.org/security-agent-ready"
116+
effect: "NoSchedule"
117+
value: "pending"
118+
119+
# "bootstrap-only" means: once the security agent is ready, we stop enforcing.
120+
# Use "continuous" mode if you want to taint the node if security agent crashes later.
121+
enforcementMode: "bootstrap-only"
122+
123+
# Update to target only the nodes that need to be protected by this guardrail
124+
nodeSelector:
125+
matchLabels:
126+
node-role.kubernetes.io/worker: ""
127+
```
128+
129+
## How to Apply
130+
131+
1. **Create the Node Readiness Rule**:
132+
```sh
133+
cd examples/security-agent-readiness
134+
kubectl apply -f security-agent-readiness-rule.yaml
135+
```
136+
137+
2. **Install Falco and Apply the RBAC**:
138+
```sh
139+
chmod +x apply-falco.sh
140+
sh apply-falco.sh
141+
```
142+
143+
## Verification
144+
145+
To verify that the guardrail is working, add a new node to the cluster.
146+
147+
1. **Check the Node Taints**:
148+
Immediately after the node joins, it should have the taint:
149+
`readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule`.
150+
151+
2. **Check Node Conditions**:
152+
Observe the node’s conditions. You will initially see `falco.org/FalcoReady` as `False` or missing. Once Falco initializes, the sidecar reporter updates the condition to `True`.
153+
154+
3. **Check Taint Removal**:
155+
As soon as the condition becomes `True`, the Node Readiness Controller removes the taint, allowing workloads to be scheduled on the node.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/bin/bash
2+
3+
# Copyright The Kubernetes Authors.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
set -e
18+
19+
KUBECTL_ARGS="$@"
20+
21+
YQ_VERSION="v4.48.1"
22+
YQ_PATH="/tmp/yq"
23+
24+
# Check if yq is installed, if not download it.
25+
if [ ! -f "$YQ_PATH" ]; then
26+
echo "yq not found at $YQ_PATH, downloading..."
27+
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
28+
ARCH=$(uname -m)
29+
case $ARCH in
30+
x86_64)
31+
ARCH="amd64"
32+
;;
33+
aarch64|arm64)
34+
ARCH="arm64"
35+
;;
36+
*)
37+
echo "Unsupported architecture: $ARCH"
38+
exit 1
39+
;;
40+
esac
41+
YQ_BINARY="yq_${OS}_${ARCH}"
42+
curl -sL "https://github.com/mikefarah/yq/releases/download/${YQ_VERSION}/${YQ_BINARY}" -o "$YQ_PATH"
43+
chmod +x "$YQ_PATH"
44+
fi
45+
46+
# Add the Falco Helm repository
47+
helm repo add falcosecurity https://falcosecurity.github.io/charts
48+
helm repo update
49+
50+
# Generate the Falco manifest
51+
helm template falco falcosecurity/falco --namespace falco --set tty=true > falco.yaml
52+
53+
# Add the security-status-patcher sidecar
54+
"$YQ_PATH" e -i \
55+
'select(.kind == "DaemonSet" and .metadata.name == "falco")
56+
.spec.template.spec.containers +=
57+
[load("hack/test-workloads/security-patcher-sidecar.yaml")]' falco.yaml
58+
59+
# Apply the manifest twice. The first time, it will create the CRDs and ServiceAccounts.
60+
# The second time, it will create the rest of the resources, which should now be able to find the ServiceAccount.
61+
kubectl apply $KUBECTL_ARGS -f falco.yaml || true
62+
kubectl apply $KUBECTL_ARGS -f falco.yaml
63+
64+
# Apply the RBAC rules
65+
kubectl apply $KUBECTL_ARGS -f ./falco-rbac-node-status-patch-role.yaml
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRole
3+
metadata:
4+
name: node-status-patch-role
5+
rules:
6+
- apiGroups: [""]
7+
resources: ["nodes"]
8+
verbs: ["get"]
9+
- apiGroups: [""]
10+
resources: ["nodes/status"]
11+
verbs: ["patch", "update"]
12+
---
13+
apiVersion: rbac.authorization.k8s.io/v1
14+
kind: ClusterRoleBinding
15+
metadata:
16+
name: falco-node-status-patch-binding
17+
roleRef:
18+
apiGroup: rbac.authorization.k8s.io
19+
kind: ClusterRole
20+
name: node-status-patch-role
21+
subjects:
22+
# Bind to security agent's ServiceAccount
23+
- kind: ServiceAccount
24+
name: falco-node
25+
namespace: kube-system
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: security-status-patcher
2+
image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
3+
imagePullPolicy: IfNotPresent
4+
env:
5+
- name: NODE_NAME
6+
valueFrom:
7+
fieldRef:
8+
fieldPath: spec.nodeName
9+
- name: CHECK_ENDPOINT
10+
value: "http://localhost:8765/healthz"
11+
- name: CONDITION_TYPE
12+
value: "falco.org/FalcoReady"
13+
- name: CHECK_INTERVAL
14+
value: "5s"
15+
resources:
16+
limits:
17+
cpu: "10m"
18+
memory: "32Mi"
19+
requests:
20+
cpu: "10m"
21+
memory: "32Mi"
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
apiVersion: readiness.node.x-k8s.io/v1alpha1
2+
kind: NodeReadinessRule
3+
metadata:
4+
name: security-agent-readiness-rule
5+
spec:
6+
dryRun: true
7+
conditions:
8+
- type: "falco.org/FalcoReady"
9+
requiredStatus: "True"
10+
taint:
11+
key: "readiness.k8s.io/falco.org/security-agent-ready"
12+
effect: "NoSchedule"
13+
value: "pending"
14+
enforcementMode: "bootstrap-only"
15+
nodeSelector:
16+
matchExpressions:
17+
- key: node-role.kubernetes.io/control-plane
18+
operator: DoesNotExist
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
apiVersion: readiness.node.x-k8s.io/v1alpha1
2+
kind: NodeReadinessRule
3+
metadata:
4+
name: security-agent-readiness-rule
5+
spec:
6+
conditions:
7+
- type: "falco.org/FalcoReady"
8+
requiredStatus: "True"
9+
taint:
10+
key: "readiness.k8s.io/falco.org/security-agent-ready"
11+
effect: "NoSchedule"
12+
value: "pending"
13+
enforcementMode: "continuous"
14+
nodeSelector:
15+
matchExpressions:
16+
- key: node-role.kubernetes.io/control-plane
17+
operator: DoesNotExist

0 commit comments

Comments
 (0)