You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2026-03-13-painless-support-with-diag.md
+65-41Lines changed: 65 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,36 +1,60 @@
1
1
---
2
2
title: "Introducing: Painless support and hands-off architecture reviews"
3
-
description: "Learn how faas-cli diag collects diagnostic data from your OpenFaaS cluster — making support requests faster and architecture reviews hands-off."
4
-
date: 2026-03-10
3
+
description: "Run one command to collect an OpenFaaS cluster report: logs, resources, events, and metrics that you can share for quick help"
4
+
date: 2026-03-13
5
5
categories:
6
6
- kubernetes
7
7
- troubleshooting
8
8
- openfaas-pro
9
9
author_staff_member: han
10
+
author_staff_member_editor: alex
10
11
dark_background: true
11
-
# image: "/images/2026-03-diag/background.png"
12
12
hide_header_image: true
13
13
---
14
14
15
-
Learn how faas-cli diag collects diagnostic data from your OpenFaaS cluster — making support requests faster and architecture reviews hands-off.
15
+
Learn how the new `diag` plugin for faas-cli can be used to diagnose issues and make architecture reviews a hands-off exercise.
16
16
17
-
`faas-cli diag` is a plugin for the faas-cli that collects diagnostic data from your OpenFaaS cluster. We built it to take the friction out of two things: getting help when something's broken, and making sure you're getting the most out of OpenFaaS.
18
-
19
-
It generates an HTML report you can open in your browser to explore graphs and visualisations, and packages everything into an archive you can quickly share to get support. One command, no manual steps, nothing to forget.
17
+
It helps you (or us together) to answer two questions: What's breaking? Are we using OpenFaaS to its full potential?
20
18
21
19

22
20
23
-
**Hands-off support**
21
+
> Diag builds a HTML report, an instructions file for AI agents, graphs, and visualisations so you can explore the data and share if necessary, to get help. One command, no manual steps, nothing to forget.
22
+
23
+
## Two case-studies
24
+
25
+
**Misconfiguration leads to an outage in production**
26
+
27
+
An enterprise customer using OpenFaaS for 3 years accidentally changed their gateway's timeout to 0.5s from 2 hours.
28
+
29
+
> An inadvertent change to values.yaml on the customer's end enforced a half second timeout, causing functions to time-out unexpectedly. We requested a "diag" run, and within a 30 minutes had found the issue, advised the team, and got them up and running again.
30
+
31
+
**It's always DNS. Actually it was a bad node in EKS.**
32
+
33
+
A defense contractor in the US that uses OpenFaaS for building AI analytics software started to complain of timeouts and reliability issues in production.
34
+
35
+
> We sent them the troubleshooting guide, and said "Can you try these?" After a couple of weeks, they'd not run any of the commands, so we went them specific commands - they ran these and shared the output. It was helpful, but we needed more.
36
+
>
37
+
> We then went down the route of trying to reproduce the issue locally, and couldn't. We told the team to try HTTP readiness probes, which sometimes cure this kind of issue.
38
+
>
39
+
> Eventually, after sending commands back and forth over the course of a few days, they sent over a "diag" run.
40
+
>
41
+
> We saw network timeouts between core Pods like NATS, the Gateway and Prometheus. Even between containers in the same Pod. The insights helped them track it down to an EKS node that had "gone bad" and needed replacement.
42
+
43
+
## Two main uses-cases
44
+
45
+
**Self-service, and pain-free support**
24
46
25
47
When something goes wrong in production, the last thing you want is to be sent to a troubleshooting guide and told to run half a dozen commands. Your product is on fire. People are starting to point the finger of blame. You just want it fixed.
26
48
27
-
That's why we built `faas-cli diag`, a single command that collects everything we need to help: deployments, function definitions, logs, events, pod status, and Prometheus metrics. Run it, send us the archive, and we can start working on your issue immediately, without a back-and-forth asking you to gather more data.
49
+
Everything that could be relevant is collected: deployments, function definitions, logs, events, pod status, and Prometheus metrics. Run it, send us the archive, and we can start working on your issue immediately, without a back-and-forth asking you to gather more data.
50
+
51
+
**Architecture review and Value extraction**
28
52
29
-
**Review your architecture**
53
+
Beyond troubleshooting, the data and graphs collected by `faas-cli diag` can help you answer broader questions about your setup: are you getting the *most value possible* from the product? Is there an OpenFaaS feature that could help with your type of workload? Is there a production incident waiting to happen because something's been mixed up in the `values.yaml`?
30
54
31
-
Beyond troubleshooting, the data and graphs collected by `faas-cli diag` can help you answer broader questions about your setup: are you getting the most value possible from the product? Is there an OpenFaaS features that could help with your type of workload? Is there a production incident waiting to happen because something's been mixed up in the `values.yaml`?
55
+
The report generated by diag gives you a starting point. You can inspect invocation rates, error rates, replica counts, and resource usage without needing to set up dashboards or port-forward to Prometheus.
32
56
33
-
The report generated by diag gives you a starting point. You can inspect invocation rates, error rates, replica counts, and resource usage without needing to set up dashboards or port-forward to Prometheus. You can also send us the archive if you'd like help with an architecture review, and we'll come back with recommendations tailored to your setup.
57
+
Reviews no longer have to be annual ceremonies.
34
58
35
59
## What does it collect?
36
60
@@ -51,25 +75,20 @@ All collected data is written to a local directory and archived into a `.tar.gz`
51
75
52
76
## Install the diag plugin
53
77
54
-
Install the diag plugin using the faas-cli plugin manager:
78
+
Install the plugin, and check the version. It's useful to run this command before very run - because we're actively improving the tool as we get feedback.
55
79
56
80
```bash
57
81
faas-cli plugin get diag
58
-
```
59
-
60
-
Verify the installation:
61
-
62
-
```bash
63
82
faas-cli diag version
64
83
```
65
84
66
85
## Generate a report
67
86
68
-
By default, diag runs against your currently selected kubectl context. Generate a configuration file, then run the tool:
87
+
By default, `diag` reads configuration from `diag.yaml` in your current directory. Generate that file first, then run the tool:
69
88
70
89
```bash
71
90
# Generate a `diag.yaml` config file
72
-
faas-cli diag config simple
91
+
faas-cli diag config simple> diag.yaml
73
92
74
93
# Run diagnostics
75
94
faas-cli diag
@@ -79,18 +98,20 @@ The first command creates a `diag.yaml` with sensible defaults that works for mo
79
98
80
99
**Staging and production**
81
100
82
-
If you manage separate clusters for staging and production, you can run diag multiple times against each environment. Either switch your kubectl context between runs, or create a dedicated config file per cluster:
101
+
Here's how you could collect data from both production and staging:
83
102
84
103
```bash
85
-
faas-cli diag config simple > diag-staging.yaml
86
-
faas-cli diag config simple > diag-prod.yaml
87
-
```
104
+
mkdir ~/diag
105
+
cd~/diag
88
106
89
-
Edit each config to set the `context` field and any other parameters for that environment, then generate a report for each:
107
+
# Generate an initial config:
108
+
faas-cli diag config simple > diag.yaml
90
109
91
-
```bash
92
-
faas-cli diag diag-staging.yaml
93
-
faas-cli diag diag-prod.yaml
110
+
kubectl config use-context eks-staging-us-east-1
111
+
faas-cli diag "staging"
112
+
113
+
kubectl config use-context eks-prod-us-east-1
114
+
faas-cli diag "prod"
94
115
```
95
116
96
117
For more advanced options like targeting specific functions or using an external Prometheus instance, see the [full configuration reference](#appendix-full-configuration-reference) at the end of this post.
@@ -100,7 +121,7 @@ For more advanced options like targeting specific functions or using an external
100
121
If you're running a multi-tenant setup with hundreds of function namespaces, you probably don't want to collect from all of them at once. Use the `--namespace` flag to target a specific subset:
101
122
102
123
```bash
103
-
faas-cli diag config simple --namespace staging --namespace production
Output is saved to the `./run`directory in a timestamped folder, along with a `.tar.gz` archive ready to share with the OpenFaaS team or colleagues. Open the generated `index.html` file in a browser to explore the collected metrics and inspect graphs:
137
+
Data is saved to `./run`- either with a date and timestamp, or with the name of the run you passed.
117
138
118
-
```bash
119
-
open ./run/2026-03-10_14-30-00/index.html
120
-
```
139
+
*`diag "prod"` creates `./run/diag/`
140
+
*`diag` on its own creates i.e. `./run/2026-03-10_14-30-00/`
141
+
142
+
To explore the data, you can open the `index.html` file in those folders.
121
143
122
144
The report includes visualisations of Prometheus metrics such as function invocation rates, error rates, and replica counts, giving you a quick overview of cluster health without needing to set up Grafana or port-forward to Prometheus yourself.
123
145
@@ -127,16 +149,18 @@ The report includes visualisations of Prometheus metrics such as function invoca
127
149

128
150
> The metrics dashboard showing function replicas, request rates by status code, and execution duration.
129
151
130
-
**AI ready**
152
+
**Diag is AI ready**
131
153
132
-
The output also includes an `AGENTS.md` file that instructs AI coding agents like Claude Code, Codex, and similar tools to interpret and diagnose issues from the collected data. This means you can outsource the first pass of a support investigation or architecture review to an AI agent, clearing up any initial issues.
154
+
The output also includes an `AGENTS.md` file that instructs AI coding agents like Claude Code, Codex, and similar tools to interpret and diagnose issues from the collected data. This gives you a fast first pass for support investigations or architecture reviews using AI, while keeping the decision loop with your team.
133
155
134
-
A word of caution: most AI coding plans retain data from anywhere between 30 days to 5 years, and some may train on customer data. Many providers offer a zero data retention option through API-based tokens and/or specific Enterprise plans. We advise very careful review of your provider's data handling policies before sending any potentially sensitive cluster data to an AI agent.
156
+
But before you load up Claude Code, Codex, or Gemini, make sure that your organisation has any of the following:
135
157
136
-
If data privacy is a concern, the realistic paths are:
158
+
- A zero-data retention agreement with your inference provider.
159
+
- Your own private deploymeny of a model to Azure/AWS/Google etc, with approved data policies.
160
+
- Or access to private, airgapped local GPUs and AI models.
161
+
- Have redacted all credentials, tokens, customer identifiers or confidential information
137
162
138
-
- Scrub or redact the collected data before passing it to an AI agent
139
-
- Use a local model. OpenFaaS Ltd has tested a number of local models with physical GPUs in airgapped environments
163
+
If in doubt, do not use any form of AI with the output, most issues can be found by humans on your end or ours.
140
164
141
165
## Useful flags and options
142
166
@@ -148,7 +172,7 @@ If data privacy is a concern, the realistic paths are:
148
172
149
173
## Wrapping up
150
174
151
-
The `faas-cli diag` plugin gives you a fast, repeatable way to collect everything needed for support requests and architecture reviews. Instead of manually running a dozen `kubectl` commands, you get a single workflow that captures logs, events, pod status, and metrics — all archived and ready to share.
175
+
The new `faas-cli diag` plugin gives you a fast, repeatable way to collect everything needed for support requests and architecture reviews. Instead of manually running a dozen `kubectl` commands, you get a single workflow that captures logs, events, pod status, and metrics — all archived and ready to share.
152
176
153
177
Whether you're debugging an incident or reviewing your cluster setup, the workflow is the same: run `faas-cli diag` and explore the report. If you need our help, send us the archive.
154
178
@@ -216,5 +240,5 @@ A few options worth noting:
216
240
-`context` - lets you target a specific kubectl context if you manage multiple clusters. Leave it empty to use whichever context is currently active.
217
241
-`functions` - uses glob patterns to filter which functions are collected. Use `'*'` for all, or patterns like `'api-*'` to narrow the scope on large clusters.
218
242
-`prometheus.url` - lets you point to an external Prometheus instance, bypassing the automatic port-forward.
219
-
-`collection` - toggles let you disable individual collectors if you only need a subset of the data.
243
+
-`collection` - toggles to disable individual collectors if you only need a subset of the data.
220
244
-`logAge` - controls how far back to collect logs retrospectively. Leave it empty to collect all available logs.
0 commit comments