Skip to content

Commit 4f86af3

Browse files
committed
Painless support with diag
Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
1 parent d3afcfb commit 4f86af3

File tree

3 files changed

+433
-174
lines changed

3 files changed

+433
-174
lines changed

_config.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ remote_theme: CloudCannon/frisco-jekyll-template@main
55

66
title: OpenFaaS - Serverless Functions Made Simple
77
url: "https://www.openfaas.com"
8-
baseurl:
8+
baseurl: ""
99
google_analytics_key: "G-MX51E38CEB"
1010
# google_maps_javascript_api_key:
1111
disqus_shortname:
@@ -60,6 +60,8 @@ exclude:
6060
- .jekyll-cache
6161
- .DS_Store
6262
- _site
63+
- vendor/bundle
64+
- vendor
6365
- site
6466
- build
6567
- out

_posts/2026-03-10-diagnose-openfaas-clusters.md renamed to _posts/2026-03-13-painless-support-with-diag.md

Lines changed: 65 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,60 @@
11
---
22
title: "Introducing: Painless support and hands-off architecture reviews"
3-
description: "Learn how faas-cli diag collects diagnostic data from your OpenFaaS cluster — making support requests faster and architecture reviews hands-off."
4-
date: 2026-03-10
3+
description: "Run one command to collect an OpenFaaS cluster report: logs, resources, events, and metrics that you can share for quick help"
4+
date: 2026-03-13
55
categories:
66
- kubernetes
77
- troubleshooting
88
- openfaas-pro
99
author_staff_member: han
10+
author_staff_member_editor: alex
1011
dark_background: true
11-
# image: "/images/2026-03-diag/background.png"
1212
hide_header_image: true
1313
---
1414

15-
Learn how faas-cli diag collects diagnostic data from your OpenFaaS cluster — making support requests faster and architecture reviews hands-off.
15+
Learn how the new `diag` plugin for faas-cli can be used to diagnose issues and make architecture reviews a hands-off exercise.
1616

17-
`faas-cli diag` is a plugin for the faas-cli that collects diagnostic data from your OpenFaaS cluster. We built it to take the friction out of two things: getting help when something's broken, and making sure you're getting the most out of OpenFaaS.
18-
19-
It generates an HTML report you can open in your browser to explore graphs and visualisations, and packages everything into an archive you can quickly share to get support. One command, no manual steps, nothing to forget.
17+
It helps you (or us together) to answer two questions: What's breaking? Are we using OpenFaaS to its full potential?
2018

2119
![End-to-end flow for faas-cli diag](/images/2026-03-diag/e2e_flow.png)
2220

23-
**Hands-off support**
21+
> Diag builds a HTML report, an instructions file for AI agents, graphs, and visualisations so you can explore the data and share if necessary, to get help. One command, no manual steps, nothing to forget.
22+
23+
## Two case-studies
24+
25+
**Misconfiguration leads to an outage in production**
26+
27+
An enterprise customer using OpenFaaS for 3 years accidentally changed their gateway's timeout to 0.5s from 2 hours.
28+
29+
> An inadvertent change to values.yaml on the customer's end enforced a half second timeout, causing functions to time-out unexpectedly. We requested a "diag" run, and within a 30 minutes had found the issue, advised the team, and got them up and running again.
30+
31+
**It's always DNS. Actually it was a bad node in EKS.**
32+
33+
A defense contractor in the US that uses OpenFaaS for building AI analytics software started to complain of timeouts and reliability issues in production.
34+
35+
> We sent them the troubleshooting guide, and said "Can you try these?" After a couple of weeks, they'd not run any of the commands, so we went them specific commands - they ran these and shared the output. It was helpful, but we needed more.
36+
>
37+
> We then went down the route of trying to reproduce the issue locally, and couldn't. We told the team to try HTTP readiness probes, which sometimes cure this kind of issue.
38+
>
39+
> Eventually, after sending commands back and forth over the course of a few days, they sent over a "diag" run.
40+
>
41+
> We saw network timeouts between core Pods like NATS, the Gateway and Prometheus. Even between containers in the same Pod. The insights helped them track it down to an EKS node that had "gone bad" and needed replacement.
42+
43+
## Two main uses-cases
44+
45+
**Self-service, and pain-free support**
2446

2547
When something goes wrong in production, the last thing you want is to be sent to a troubleshooting guide and told to run half a dozen commands. Your product is on fire. People are starting to point the finger of blame. You just want it fixed.
2648

27-
That's why we built `faas-cli diag`, a single command that collects everything we need to help: deployments, function definitions, logs, events, pod status, and Prometheus metrics. Run it, send us the archive, and we can start working on your issue immediately, without a back-and-forth asking you to gather more data.
49+
Everything that could be relevant is collected: deployments, function definitions, logs, events, pod status, and Prometheus metrics. Run it, send us the archive, and we can start working on your issue immediately, without a back-and-forth asking you to gather more data.
50+
51+
**Architecture review and Value extraction**
2852

29-
**Review your architecture**
53+
Beyond troubleshooting, the data and graphs collected by `faas-cli diag` can help you answer broader questions about your setup: are you getting the *most value possible* from the product? Is there an OpenFaaS feature that could help with your type of workload? Is there a production incident waiting to happen because something's been mixed up in the `values.yaml`?
3054

31-
Beyond troubleshooting, the data and graphs collected by `faas-cli diag` can help you answer broader questions about your setup: are you getting the most value possible from the product? Is there an OpenFaaS features that could help with your type of workload? Is there a production incident waiting to happen because something's been mixed up in the `values.yaml`?
55+
The report generated by diag gives you a starting point. You can inspect invocation rates, error rates, replica counts, and resource usage without needing to set up dashboards or port-forward to Prometheus.
3256

33-
The report generated by diag gives you a starting point. You can inspect invocation rates, error rates, replica counts, and resource usage without needing to set up dashboards or port-forward to Prometheus. You can also send us the archive if you'd like help with an architecture review, and we'll come back with recommendations tailored to your setup.
57+
Reviews no longer have to be annual ceremonies.
3458

3559
## What does it collect?
3660

@@ -51,25 +75,20 @@ All collected data is written to a local directory and archived into a `.tar.gz`
5175

5276
## Install the diag plugin
5377

54-
Install the diag plugin using the faas-cli plugin manager:
78+
Install the plugin, and check the version. It's useful to run this command before very run - because we're actively improving the tool as we get feedback.
5579

5680
```bash
5781
faas-cli plugin get diag
58-
```
59-
60-
Verify the installation:
61-
62-
```bash
6382
faas-cli diag version
6483
```
6584

6685
## Generate a report
6786

68-
By default, diag runs against your currently selected kubectl context. Generate a configuration file, then run the tool:
87+
By default, `diag` reads configuration from `diag.yaml` in your current directory. Generate that file first, then run the tool:
6988

7089
```bash
7190
# Generate a `diag.yaml` config file
72-
faas-cli diag config simple
91+
faas-cli diag config simple > diag.yaml
7392

7493
# Run diagnostics
7594
faas-cli diag
@@ -79,18 +98,20 @@ The first command creates a `diag.yaml` with sensible defaults that works for mo
7998

8099
**Staging and production**
81100

82-
If you manage separate clusters for staging and production, you can run diag multiple times against each environment. Either switch your kubectl context between runs, or create a dedicated config file per cluster:
101+
Here's how you could collect data from both production and staging:
83102

84103
```bash
85-
faas-cli diag config simple > diag-staging.yaml
86-
faas-cli diag config simple > diag-prod.yaml
87-
```
104+
mkdir ~/diag
105+
cd ~/diag
88106

89-
Edit each config to set the `context` field and any other parameters for that environment, then generate a report for each:
107+
# Generate an initial config:
108+
faas-cli diag config simple > diag.yaml
90109

91-
```bash
92-
faas-cli diag diag-staging.yaml
93-
faas-cli diag diag-prod.yaml
110+
kubectl config use-context eks-staging-us-east-1
111+
faas-cli diag "staging"
112+
113+
kubectl config use-context eks-prod-us-east-1
114+
faas-cli diag "prod"
94115
```
95116

96117
For more advanced options like targeting specific functions or using an external Prometheus instance, see the [full configuration reference](#appendix-full-configuration-reference) at the end of this post.
@@ -100,7 +121,7 @@ For more advanced options like targeting specific functions or using an external
100121
If you're running a multi-tenant setup with hundreds of function namespaces, you probably don't want to collect from all of them at once. Use the `--namespace` flag to target a specific subset:
101122

102123
```bash
103-
faas-cli diag config simple --namespace staging --namespace production
124+
faas-cli diag config simple --namespace tenant-1 --namespace tenant-2
104125
```
105126

106127
Or use `'*'` to automatically discover all OpenFaaS function namespaces:
@@ -113,11 +134,12 @@ faas-cli diag config simple --namespace '*'
113134

114135
## Exploring the report
115136

116-
Output is saved to the `./run` directory in a timestamped folder, along with a `.tar.gz` archive ready to share with the OpenFaaS team or colleagues. Open the generated `index.html` file in a browser to explore the collected metrics and inspect graphs:
137+
Data is saved to `./run` - either with a date and timestamp, or with the name of the run you passed.
117138

118-
```bash
119-
open ./run/2026-03-10_14-30-00/index.html
120-
```
139+
* `diag "prod"` creates `./run/diag/`
140+
* `diag` on its own creates i.e. `./run/2026-03-10_14-30-00/`
141+
142+
To explore the data, you can open the `index.html` file in those folders.
121143

122144
The report includes visualisations of Prometheus metrics such as function invocation rates, error rates, and replica counts, giving you a quick overview of cluster health without needing to set up Grafana or port-forward to Prometheus yourself.
123145

@@ -127,16 +149,18 @@ The report includes visualisations of Prometheus metrics such as function invoca
127149
![The metrics dashboard showing function replicas, request rates by status code, and execution duration.](/images/2026-03-diag/report-metrics-dashboard.png)
128150
> The metrics dashboard showing function replicas, request rates by status code, and execution duration.
129151
130-
**AI ready**
152+
**Diag is AI ready**
131153

132-
The output also includes an `AGENTS.md` file that instructs AI coding agents like Claude Code, Codex, and similar tools to interpret and diagnose issues from the collected data. This means you can outsource the first pass of a support investigation or architecture review to an AI agent, clearing up any initial issues.
154+
The output also includes an `AGENTS.md` file that instructs AI coding agents like Claude Code, Codex, and similar tools to interpret and diagnose issues from the collected data. This gives you a fast first pass for support investigations or architecture reviews using AI, while keeping the decision loop with your team.
133155

134-
A word of caution: most AI coding plans retain data from anywhere between 30 days to 5 years, and some may train on customer data. Many providers offer a zero data retention option through API-based tokens and/or specific Enterprise plans. We advise very careful review of your provider's data handling policies before sending any potentially sensitive cluster data to an AI agent.
156+
But before you load up Claude Code, Codex, or Gemini, make sure that your organisation has any of the following:
135157

136-
If data privacy is a concern, the realistic paths are:
158+
- A zero-data retention agreement with your inference provider.
159+
- Your own private deploymeny of a model to Azure/AWS/Google etc, with approved data policies.
160+
- Or access to private, airgapped local GPUs and AI models.
161+
- Have redacted all credentials, tokens, customer identifiers or confidential information
137162

138-
- Scrub or redact the collected data before passing it to an AI agent
139-
- Use a local model. OpenFaaS Ltd has tested a number of local models with physical GPUs in airgapped environments
163+
If in doubt, do not use any form of AI with the output, most issues can be found by humans on your end or ours.
140164

141165
## Useful flags and options
142166

@@ -148,7 +172,7 @@ If data privacy is a concern, the realistic paths are:
148172

149173
## Wrapping up
150174

151-
The `faas-cli diag` plugin gives you a fast, repeatable way to collect everything needed for support requests and architecture reviews. Instead of manually running a dozen `kubectl` commands, you get a single workflow that captures logs, events, pod status, and metrics — all archived and ready to share.
175+
The new `faas-cli diag` plugin gives you a fast, repeatable way to collect everything needed for support requests and architecture reviews. Instead of manually running a dozen `kubectl` commands, you get a single workflow that captures logs, events, pod status, and metrics — all archived and ready to share.
152176

153177
Whether you're debugging an incident or reviewing your cluster setup, the workflow is the same: run `faas-cli diag` and explore the report. If you need our help, send us the archive.
154178

@@ -216,5 +240,5 @@ A few options worth noting:
216240
- `context` - lets you target a specific kubectl context if you manage multiple clusters. Leave it empty to use whichever context is currently active.
217241
- `functions` - uses glob patterns to filter which functions are collected. Use `'*'` for all, or patterns like `'api-*'` to narrow the scope on large clusters.
218242
- `prometheus.url` - lets you point to an external Prometheus instance, bypassing the automatic port-forward.
219-
- `collection` - toggles let you disable individual collectors if you only need a subset of the data.
243+
- `collection` - toggles to disable individual collectors if you only need a subset of the data.
220244
- `logAge` - controls how far back to collect logs retrospectively. Leave it empty to collect all available logs.

0 commit comments

Comments
 (0)