docs: add example for security agent readiness (#101)

arnab-logs · web-flow · commit 16dac8623d28 · 2026-02-20T03:43:39.000+05:30
diff --git a/docs/book/src/SUMMARY.md b/docs/book/src/SUMMARY.md
@@ -13,7 +13,7 @@
 
 - [CNI Installation](./examples/cni-readiness.md)
 <!-- - [Storage Drivers](./examples/storage-readiness.md) -->
-<!-- - [Security Agent](./examples/security-agent-readiness.md) -->
+- [Security Agent](./examples/security-agent-readiness.md)
 <!-- - [Device Drivers](./examples/dra-readiness.md) -->
 
 # Releases
diff --git a/docs/book/src/examples/security-agent-readiness.md b/docs/book/src/examples/security-agent-readiness.md
@@ -0,0 +1,155 @@
+# Security Agent Readiness Guardrail
+
+This guide demonstrates how to use the Node Readiness Controller to prevent workloads from being scheduled on a node until a security agent (for example, [Falco](https://github.com/falcosecurity/falco)) is fully initialized and actively monitoring the node.
+
+## The Problem
+
+In many Kubernetes clusters, security agents are deployed as DaemonSets. When a new node joins the cluster, there is a race condition:
+1. A new node joins the cluster and is marked `Ready` by the kubelet.
+2. The scheduler sees the node as `Ready` and considers the node eligible for workloads.
+3. However, the security agent on that node may still be starting or initializing.
+
+**Result**: Application workloads may start running before node is security compliant, creating a blind spot where runtime threats, policy violations, or anomalous behavior may go undetected.
+
+## The Solution
+
+We can use the Node Readiness Controller to enforce a security readiness guardrail:
+1. **Taint** the node with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule` as soon as it joins the cluster.
+2. **Monitor** the security agent’s readiness using a sidecar and expose it as a Node Condition.
+3. **Untaint** the node only after the security agent reports that it is ready.
+
+## Step-by-Step Guide (Falco Example)
+
+This example uses **Falco** as a representative security agent, but the same pattern applies to any node-level security or monitoring agent.
+
+> **Note**:  All manifests referenced in this guide are available in the [`examples/security-agent-readiness`](https://github.com/kubernetes-sigs/node-readiness-controller/tree/main/examples/security-agent-readiness) directory.
+
+
+
+### 1. Deploy the Readiness Condition Reporter
+
+To bridge the security agent’s internal health signal to Kubernetes, we deploy a readiness reporter that updates a Node Condition. In this example, the reporter is deployed as a sidecar container in the Falco DaemonSet. Components that natively update Node conditions would not require this additional container.
+
+This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`.
+
+**Patch your Falco DaemonSet:**
+
+```yaml
+# security-agent-reporter-sidecar.yaml
+- name: security-status-patcher
+  image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
+  imagePullPolicy: IfNotPresent
+  env:
+    - name: NODE_NAME
+      valueFrom:
+        fieldRef:
+          fieldPath: spec.nodeName
+    - name: CHECK_ENDPOINT
+      value: "http://localhost:8765/healthz" # Update the right security agent endpoint
+    - name: CONDITION_TYPE
+      value: "falco.org/FalcoReady"   # Update the right condition
+    - name: CHECK_INTERVAL
+      value: "5s"
+  resources:
+    limits:
+      cpu: "10m"
+      memory: "32Mi"
+    requests:
+      cpu: "10m"
+      memory: "32Mi"
+```
+
+> Note: In this example, the security agent’s health is monitored by a side-car, so the reporter’s lifecycle is the same as the pod lifecycle. If the Falco pod is crashlooping, the sidecar will not run and cannot report readiness. For robust `continuous` readiness reporting, the reporter should be deployed independently of the security agent pod. For example, a separate DaemonSet (similar to Node Problem Detector) can monitor the agent and update Node conditions even if the agent pod crashes.
+
+### 2. Grant Permissions (RBAC)
+
+The readiness reporter sidecar needs permission to update the Node object's status to publish readiness information.
+
+```yaml
+# security-agent-node-status-rbac.yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: node-status-patch-role
+rules:
+- apiGroups: [""]
+  resources: ["nodes"]
+  verbs: ["get"]
+- apiGroups: [""]
+  resources: ["nodes/status"]
+  verbs: ["patch", "update"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: security-agent-node-status-patch-binding
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: node-status-patch-role
+subjects:
+# Bind to security agent's ServiceAccount
+- kind: ServiceAccount
+  name: falco
+  namespace: kube-system
+```
+
+### 3. Create the Node Readiness Rule
+
+Next, define a NodeReadinessRule that enforces the security readiness requirement. This rule instructs the controller: *"Keep the `readiness.k8s.io/falco.org/security-agent-ready` taint on the node until the `falco.org/FalcoReady` condition becomes True."*
+
+```yaml
+# security-agent-readiness-rule.yaml
+apiVersion: readiness.node.x-k8s.io/v1alpha1
+kind: NodeReadinessRule
+metadata:
+  name: security-agent-readiness-rule
+spec:
+  # Conditions that must be satisfied before the taint is removed
+  conditions:
+    - type: "falco.org/FalcoReady"
+      requiredStatus: "True"
+
+  # Taint managed by this rule
+  taint:
+    key: "readiness.k8s.io/falco.org/security-agent-ready"
+    effect: "NoSchedule"
+    value: "pending"
+
+  # "bootstrap-only" means: once the security agent is ready, we stop enforcing.
+  # Use "continuous" mode if you want to taint the node if security agent crashes later. 
+  enforcementMode: "bootstrap-only"
+
+  # Update to target only the nodes that need to be protected by this guardrail
+  nodeSelector:
+    matchLabels:
+      node-role.kubernetes.io/worker: ""
+```
+
+## How to Apply
+
+1.  **Create the Node Readiness Rule**:
+```sh
+   cd examples/security-agent-readiness
+   kubectl apply -f security-agent-readiness-rule.yaml
+   ```
+
+2. **Install Falco and Apply the RBAC**:
+```sh
+chmod +x apply-falco.sh
+sh apply-falco.sh
+```
+
+## Verification
+
+To verify that the guardrail is working, add a new node to the cluster.
+
+1. **Check the Node Taints**:
+Immediately after the node joins, it should have the taint:
+`readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule`.
+
+2. **Check Node Conditions**:
+Observe the node’s conditions. You will initially see `falco.org/FalcoReady` as `False` or missing. Once Falco initializes, the sidecar reporter updates the condition to `True`.
+
+3. **Check Taint Removal**:
+As soon as the condition becomes `True`, the Node Readiness Controller removes the taint, allowing workloads to be scheduled on the node.
diff --git a/examples/security-agent-readiness/apply-falco.sh b/examples/security-agent-readiness/apply-falco.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+
+# Copyright The Kubernetes Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -e
+
+KUBECTL_ARGS="$@"
+
+YQ_VERSION="v4.48.1"
+YQ_PATH="/tmp/yq"
+
+# Check if yq is installed, if not download it.
+if [ ! -f "$YQ_PATH" ]; then
+    echo "yq not found at $YQ_PATH, downloading..."
+    OS=$(uname -s | tr '[:upper:]' '[:lower:]')
+    ARCH=$(uname -m)
+    case $ARCH in
+        x86_64)
+            ARCH="amd64"
+            ;;
+        aarch64|arm64)
+            ARCH="arm64"
+            ;;
+        *)
+            echo "Unsupported architecture: $ARCH"
+            exit 1
+            ;;
+    esac
+    YQ_BINARY="yq_${OS}_${ARCH}"
+    curl -sL "https://github.com/mikefarah/yq/releases/download/${YQ_VERSION}/${YQ_BINARY}" -o "$YQ_PATH"
+    chmod +x "$YQ_PATH"
+fi
+
+# Add the Falco Helm repository
+helm repo add falcosecurity https://falcosecurity.github.io/charts
+helm repo update
+
+# Generate the Falco manifest
+helm template falco falcosecurity/falco --namespace falco --set tty=true > falco.yaml
+
+# Add the security-status-patcher sidecar
+"$YQ_PATH" e -i \
+  'select(.kind == "DaemonSet" and .metadata.name == "falco")
+   .spec.template.spec.containers +=
+   [load("hack/test-workloads/security-patcher-sidecar.yaml")]' falco.yaml
+
+# Apply the manifest twice. The first time, it will create the CRDs and ServiceAccounts.
+# The second time, it will create the rest of the resources, which should now be able to find the ServiceAccount.
+kubectl apply $KUBECTL_ARGS -f falco.yaml || true
+kubectl apply $KUBECTL_ARGS -f falco.yaml
+
+# Apply the RBAC rules
+kubectl apply $KUBECTL_ARGS -f ./falco-rbac-node-status-patch-role.yaml
diff --git a/examples/security-agent-readiness/falco-rbac-node-status-rbac.yaml b/examples/security-agent-readiness/falco-rbac-node-status-rbac.yaml
@@ -0,0 +1,25 @@
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: node-status-patch-role
+rules:
+- apiGroups: [""]
+  resources: ["nodes"]
+  verbs: ["get"]
+- apiGroups: [""]
+  resources: ["nodes/status"]
+  verbs: ["patch", "update"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: falco-node-status-patch-binding
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: node-status-patch-role
+subjects:
+# Bind to security agent's ServiceAccount
+- kind: ServiceAccount
+  name: falco-node
+  namespace: kube-system
diff --git a/examples/security-agent-readiness/security-agent-patcher-sidecar.yaml b/examples/security-agent-readiness/security-agent-patcher-sidecar.yaml
@@ -0,0 +1,21 @@
+name: security-status-patcher
+image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
+imagePullPolicy: IfNotPresent
+env:
+  - name: NODE_NAME
+    valueFrom:
+      fieldRef:
+        fieldPath: spec.nodeName
+  - name: CHECK_ENDPOINT
+    value: "http://localhost:8765/healthz"
+  - name: CONDITION_TYPE
+    value: "falco.org/FalcoReady"
+  - name: CHECK_INTERVAL
+    value: "5s"
+resources:
+  limits:
+    cpu: "10m"
+    memory: "32Mi"
+  requests:
+    cpu: "10m"
+    memory: "32Mi"
diff --git a/examples/security-agent-readiness/security-agent-readiness-dryrun-rule.yaml b/examples/security-agent-readiness/security-agent-readiness-dryrun-rule.yaml
@@ -0,0 +1,18 @@
+apiVersion: readiness.node.x-k8s.io/v1alpha1
+kind: NodeReadinessRule
+metadata:
+  name: security-agent-readiness-rule
+spec:
+  dryRun: true
+  conditions:
+    - type: "falco.org/FalcoReady"
+      requiredStatus: "True"
+  taint:
+    key: "readiness.k8s.io/falco.org/security-agent-ready"
+    effect: "NoSchedule"
+    value: "pending"
+  enforcementMode: "bootstrap-only"
+  nodeSelector:
+    matchExpressions:
+      - key: node-role.kubernetes.io/control-plane
+        operator: DoesNotExist
diff --git a/examples/security-agent-readiness/security-agent-readiness-rule.yaml b/examples/security-agent-readiness/security-agent-readiness-rule.yaml
@@ -0,0 +1,17 @@
+apiVersion: readiness.node.x-k8s.io/v1alpha1
+kind: NodeReadinessRule
+metadata:
+  name: security-agent-readiness-rule
+spec:
+  conditions:
+    - type: "falco.org/FalcoReady"
+      requiredStatus: "True"
+  taint:
+    key: "readiness.k8s.io/falco.org/security-agent-ready"
+    effect: "NoSchedule"
+    value: "pending"
+  enforcementMode: "continuous"
+  nodeSelector:
+    matchExpressions:
+      - key: node-role.kubernetes.io/control-plane
+        operator: DoesNotExist