openfaas
diff --git a/‎_posts/2026-03-10-diagnose-openfaas-clusters.md‎
Lines changed: 220 additions & 0 deletions b/‎_posts/2026-03-10-diagnose-openfaas-clusters.md‎
Lines changed: 220 additions & 0 deletions
diff --git a/‎images/2026-03-diag/e2e_flow.png‎
66.9 KB b/‎images/2026-03-diag/e2e_flow.png‎
66.9 KB
diff --git a/‎images/2026-03-diag/report-metrics-dashboard.png‎
109 KB b/‎images/2026-03-diag/report-metrics-dashboard.png‎
109 KB
diff --git a/‎images/2026-03-diag/report-summary.png‎
26.7 KB b/‎images/2026-03-diag/report-summary.png‎
26.7 KB
@@ -0,0 +1,220 @@
+---
+title: "Introducing: Painless support and hands-off architecture reviews"
+description: "Learn how faas-cli diag collects diagnostic data from your OpenFaaS cluster — making support requests faster and architecture reviews hands-off."
+date: 2026-03-10
+categories:
+- kubernetes
+- troubleshooting
+- openfaas-pro
+author_staff_member: han
+dark_background: true
+# image: "/images/2026-03-diag/background.png"
+hide_header_image: true
+---
+
+Learn how faas-cli diag collects diagnostic data from your OpenFaaS cluster — making support requests faster and architecture reviews hands-off.
+
+`faas-cli diag` is a plugin for the faas-cli that collects diagnostic data from your OpenFaaS cluster. We built it to take the friction out of two things: getting help when something's broken, and making sure you're getting the most out of OpenFaaS.
+
+It generates an HTML report you can open in your browser to explore graphs and visualisations, and packages everything into an archive you can quickly share to get support. One command, no manual steps, nothing to forget.
+
+![End-to-end flow for faas-cli diag](/images/2026-03-diag/e2e_flow.png)
+
+**Hands-off support**
+
+When something goes wrong in production, the last thing you want is to be sent to a troubleshooting guide and told to run half a dozen commands. Your product is on fire. People are starting to point the finger of blame. You just want it fixed.
+
+That's why we built `faas-cli diag`, a single command that collects everything we need to help: deployments, function definitions, logs, events, pod status, and Prometheus metrics. Run it, send us the archive, and we can start working on your issue immediately, without a back-and-forth asking you to gather more data.
+
+**Review your architecture**
+
+Beyond troubleshooting, the data and graphs collected by `faas-cli diag` can help you answer broader questions about your setup: are you getting the most value possible from the product? Is there an OpenFaaS features that could help with your type of workload? Is there a production incident waiting to happen because something's been mixed up in the `values.yaml`?
+
+The report generated by diag gives you a starting point. You can inspect invocation rates, error rates, replica counts, and resource usage without needing to set up dashboards or port-forward to Prometheus. You can also send us the archive if you'd like help with an architecture review, and we'll come back with recommendations tailored to your setup.
+
+## What does it collect?
+
+The diag tool gathers the following from your cluster:
+
+- **Deployment YAMLs** — exported specs for OpenFaaS core components and functions
+- **Function CRs** — Custom Resource definitions for deployed functions
+- **Kubernetes events** — cluster events from the OpenFaaS and function namespaces
+- **Pod status** — output from `kubectl get` and `kubectl describe` for all relevant pods
+- **Container logs** — streamed via [stern](https://github.com/stern/stern) for real-time and retrospective log collection
+- **Node info** — inventory and descriptions for all cluster nodes
+- **Helm values** — user-supplied values for the OpenFaaS Helm release
+- **Ingress & Gateway API** — Ingress, IngressClass, HTTPRoute, and GatewayClass resources
+- **Network Policies** — NetworkPolicy resources from OpenFaaS and function namespaces
+- **Prometheus metrics** — metrics snapshots and visualisations covering replicas, request rates, latencies, and resource usage
+
+All collected data is written to a local directory and archived into a `.tar.gz` file for easy sharing. The tool is 100% offline — no information is shared with anyone, including OpenFaaS Ltd, by default.
+
+## Install the diag plugin
+
+Install the diag plugin using the faas-cli plugin manager:
+
+```bash
+faas-cli plugin get diag
+```
+
+Verify the installation:
+
+```bash
+faas-cli diag version
+```
+
+## Generate a report
+
+By default, diag runs against your currently selected kubectl context. Generate a configuration file, then run the tool:
+
+```bash
+# Generate a `diag.yaml` config file
+faas-cli diag config simple
+
+# Run diagnostics
+faas-cli diag
+```
+
+The first command creates a `diag.yaml` with sensible defaults that works for most setups. The second starts the collection: it sets up port-forwards, streams logs, collects Kubernetes resources, and scrapes Prometheus metrics. Press `Control+C` once to stop gracefully, it will finish collecting and write all output to disk.
+
+**Staging and production**
+
+If you manage separate clusters for staging and production, you can run diag multiple times against each environment. Either switch your kubectl context between runs, or create a dedicated config file per cluster:
+
+```bash
+faas-cli diag config simple > diag-staging.yaml
+faas-cli diag config simple > diag-prod.yaml
+```
+
+Edit each config to set the `context` field and any other parameters for that environment, then generate a report for each:
+
+```bash
+faas-cli diag diag-staging.yaml
+faas-cli diag diag-prod.yaml
+```
+
+For more advanced options like targeting specific functions or using an external Prometheus instance, see the [full configuration reference](#appendix-full-configuration-reference) at the end of this post.
+
+**Running at scale with hundreds of namespaces**
+
+If you're running a multi-tenant setup with hundreds of function namespaces, you probably don't want to collect from all of them at once. Use the `--namespace` flag to target a specific subset:
+
+```bash
+faas-cli diag config simple --namespace staging --namespace production
+```
+
+Or use `'*'` to automatically discover all OpenFaaS function namespaces:
+
+```bash
+faas-cli diag config simple --namespace '*'
+```
+
+<script src="https://asciinema.org/a/tsVGRdQhWh7p32hp.js" id="asciicast-tsVGRdQhWh7p32hp" async="true" data-autoplay="true" data-loop="true"></script>
+
+## Exploring the report
+
+Output is saved to the `./run` directory in a timestamped folder, along with a `.tar.gz` archive ready to share with the OpenFaaS team or colleagues. Open the generated `index.html` file in a browser to explore the collected metrics and inspect graphs:
+
+```bash
+open ./run/2026-03-10_14-30-00/index.html
+```
+
+The report includes visualisations of Prometheus metrics such as function invocation rates, error rates, and replica counts, giving you a quick overview of cluster health without needing to set up Grafana or port-forward to Prometheus yourself.
+
+![The report summary page with quick links to metrics, CRDs, pods, events, and logs per namespace.](/images/2026-03-diag/report-summary.png)
+> The report summary page with quick links to metrics, CRDs, pods, events, and logs per namespace.
+
+![The metrics dashboard showing function replicas, request rates by status code, and execution duration.](/images/2026-03-diag/report-metrics-dashboard.png)
+> The metrics dashboard showing function replicas, request rates by status code, and execution duration.
+
+**AI ready**
+
+The output also includes an `AGENTS.md` file that instructs AI coding agents like Claude Code, Codex, and similar tools to interpret and diagnose issues from the collected data. This means you can outsource the first pass of a support investigation or architecture review to an AI agent, clearing up any initial issues.
+
+A word of caution: most AI coding plans retain data from anywhere between 30 days to 5 years, and some may train on customer data. Many providers offer a zero data retention option through API-based tokens and/or specific Enterprise plans. We advise very careful review of your provider's data handling policies before sending any potentially sensitive cluster data to an AI agent.
+
+If data privacy is a concern, the realistic paths are:
+
+- Scrub or redact the collected data before passing it to an AI agent
+- Use a local model. OpenFaaS Ltd has tested a number of local models with physical GPUs in airgapped environments
+
+## Useful flags and options
+
+| Flag / Command | Description | Example |
+|---|---|---|
+| `-d/--duration` | Auto-stop after a set duration | `faas-cli diag -d 5m` |
+| `--age` | Collect logs from a past time window | `faas-cli diag --age 1h` |
+| `diag [run-name]` | Custom name for the run (positional argument) | `faas-cli diag incident-456` |
+
+## Wrapping up
+
+The `faas-cli diag` plugin gives you a fast, repeatable way to collect everything needed for support requests and architecture reviews. Instead of manually running a dozen `kubectl` commands, you get a single workflow that captures logs, events, pod status, and metrics — all archived and ready to share.
+
+Whether you're debugging an incident or reviewing your cluster setup, the workflow is the same: run `faas-cli diag` and explore the report. If you need our help, send us the archive.
+
+For more details, see the [Troubleshooting docs](https://docs.openfaas.com/deployment/troubleshooting/).
+
+## Appendix: full configuration reference
+
+Generate the full configuration template with:
+
+```bash
+faas-cli diag config full
+```
+
+```yaml
+# Identify the cluster and kubectl context
+clusterName: "production-cluster"
+context: ""  # Leave empty to use current context
+
+# Namespaces to collect from
+namespaces:
+  openfaas: openfaas
+  functions:
+    - openfaas-fn
+    - staging-fn
+    - production-fn
+
+# Function filter patterns (glob-style)
+functions:
+  - 'api-*'
+  - 'webhook-*'
+
+# Prometheus configuration
+prometheus:
+  enabled: true
+  service: prometheus
+  targetPort: 9090
+  # Use a custom URL if Prometheus is outside the openfaas namespace
+  # url: "http://prometheus.monitoring.svc.cluster.local:9090"
+
+# Gateway configuration
+gateway:
+  enabled: true
+  service: gateway
+  targetPort: 8080
+  autoAuth: true
+
+# What to collect
+collection:
+  deployments: true
+  functionCRs: true
+  events: true
+  podStatus: true
+  logs: true
+  metrics: true
+  logAge: "1h"
+
+# Output directory and run name
+output:
+  directory: "./run"
+  # runName: "incident-123"
+```
+
+A few options worth noting:
+
+- `context` - lets you target a specific kubectl context if you manage multiple clusters. Leave it empty to use whichever context is currently active.
+- `functions` - uses glob patterns to filter which functions are collected. Use `'*'` for all, or patterns like `'api-*'` to narrow the scope on large clusters.
+- `prometheus.url` - lets you point to an external Prometheus instance, bypassing the automatic port-forward.
+- `collection` - toggles let you disable individual collectors if you only need a subset of the data.
+- `logAge` - controls how far back to collect logs retrospectively. Leave it empty to collect all available logs.