Self-healing monitoring stack: FastAPI health checks + Prometheus/Grafana/Alertmanager + RabbitMQ event bus, with automated recovery for cloud/on-prem infrastructure.
RabbitWatch is a small, opinionated control-plane that keeps your critical services (VPN, NAS, message brokers, databases, dashboards, VMs) up and self-healing. A FastAPI service runs periodic checks; when something fails, events flow through RabbitMQ to a Control Plane (CPController) that decides whether to retry, recover, or escalate — and pushes the resulting metrics to Prometheus/Grafana.
It is designed as a drop-in observability + recovery layer for small-to-medium Linux fleets that can't justify a full commercial APM, but still need actionable alerting and hands-off remediation.
Built to solve a real problem: keeping a small Linux fleet (VPN, NAS, RabbitMQ, MongoDB, Grafana) up and self-healing — without paying for a commercial APM like Datadog or New Relic. RabbitWatch is the drop-in observability + recovery layer I wanted to exist for small-to-medium fleets that can't justify enterprise tooling but still need actionable alerting and hands-off remediation.
The architecture intentionally mirrors enterprise patterns (event-driven decoupling via RabbitMQ, Prometheus/Grafana standard stack, separation between check logic and recovery logic) — proving the same design ideas work at hobby scale and at production scale, with no architectural rewrite in between.
Designed for minimal self-hosting — runs on a single small VPS:
| Component | Cost | Notes |
|---|---|---|
| Linux host (1 vCPU / 2 GB RAM) | ~$5/month | Any small VPS — Hetzner CX11, Vultr, OVH Eco |
| All containers (FastAPI, RabbitMQ, Prometheus, Grafana, MongoDB, Portainer) | $0 | Self-hosted via Docker Compose |
| Optional: Cloudflare Tunnel | Free | If you want HTTPS without exposing ports |
The whole stack fits in ~1.5 GB RAM. The point isn't to compete with $50+/month commercial APMs on features — it's to demonstrate that the same architectural patterns (event-driven, separation of concerns, exporters + scrape model) can run on a single $5 VM when you do the engineering yourself.
flowchart LR
U["Admin / DevOps"] -->|"GET /monitor"| API
subgraph API["FastAPI Monitor"]
HC["Periodic health checks<br/>(TCP · HTTP · MongoDB)"]
EP["REST endpoint /monitor"]
end
HC -->|"KO events"| CP
subgraph CP["Control Plane"]
CTRL["CPController<br/>classification"]
REC["Recovery / escalation"]
CTRL --> REC
end
HC -->|"events + metrics"| MQ
subgraph MQ["RabbitMQ"]
QS["queues"]
PC["Python producer / consumer"]
QS <--> PC
end
PC -->|"write"| DB[("MongoDB<br/>metrics history")]
subgraph OBS["Observability"]
EX["Node + MongoDB<br/>exporters"]
PR["Prometheus"]
AM["Alertmanager"]
GF["Grafana<br/>dashboards"]
EX --> PR --> GF
PR --> AM
end
HC -.->|"scrape"| PR
PO["Portainer"] -.->|"manages"| API
PO -.->|"manages"| MQ
PO -.->|"manages"| OBS
- Active health checks — TCP, HTTP (with basic auth), and MongoDB reachability against a YAML-declared set of endpoints.
- Event-driven recovery — failures are published to RabbitMQ; the
CPControllerdecides on retry / escalation strategy without blocking the check loop. - Metrics pipeline — a Python producer pushes structured metrics to MongoDB (with TTL indexes for retention) and to Prometheus via exporters; Grafana visualizes them.
- Background thread — health checks run continuously without external schedulers; the REST endpoint just exposes the latest aggregate.
- Portainer-friendly — every component is a standalone container managed via
docker-compose; Portainer gives a visual UI if you want one. - Extensible — add a new service type by extending
fastapi_monitor.pyand the YAML schema. - Hardenable — deployment guidance and threat model are documented in SECURITY.md.
| Layer | Component |
|---|---|
| HTTP monitor + REST API | FastAPI, Uvicorn, requests |
| Event bus | RabbitMQ + exporters |
| Metrics store | MongoDB + MongoDB exporter |
| Metrics collection | Prometheus + Alertmanager |
| Visualization | Grafana |
| Orchestration | Docker Compose + systemd |
| Container management | Portainer (optional) |
Pinned Python dependencies live in requirements.txt. Container versions are pinned in docker-compose.yml.
Prerequisites: Docker (+ Compose plugin), a Linux host with at least 2 GB RAM, and one free port for the monitor (default
8000).
-
Create the Docker network (first run only):
docker network create monitoring
-
Copy and edit the config:
cp monitor_settings.example.yaml monitor_settings.yaml # then edit monitor_settings.yaml with your endpoints and credentials -
Bring the stack up:
docker compose up -d
-
Hit the monitor:
curl http://localhost:8000/monitor
Sample response:
{ "vpn": "ok", "nas": "ok", "rabbitmq": "ok", "prometheus": "ok", "grafana": "ok", "portainer": "ok", "mongodb": "ok", "ec2_tcp": "ok", "all_critical_ok": true }
If all_critical_ok is false, the failing service name is the field to check, and a KO event will have already been published to RabbitMQ for the Control Plane to handle.
- VPNs and tunnels (TCP reachability)
- NAS and file servers (HTTP endpoints)
- RabbitMQ queues (management API + exporter)
- Prometheus, Grafana, Portainer (health APIs)
- MongoDB clusters (native driver)
- EC2 or any VM (TCP + optional HTTP)
Anything else is one extension of monitor_settings.yaml + one check function in fastapi_monitor.py away.
vpn_host: "YOUR_VPN_IP"
vpn_port: 1194
nas_url: "http://YOUR_NAS_IP:9100/metrics"
rabbitmq_api: "http://rabbitmq:15672/api/health/checks/alarms"
rabbitmq_user: "youruser"
rabbitmq_pass: "yourpassword"
prometheus_url: "http://prometheus:9090/-/healthy"
grafana_url: "http://grafana:3000/api/health"
portainer_url: "http://portainer:9000/api/status"
mongodb_uri: "mongodb+srv://youruser:yourpassword@yourcluster.mongodb.net/?authSource=admin"
ec2_host: "YOUR_EC2_IP"
ec2_port: 22
ec2_http_url: nullNever commit the real
monitor_settings.yaml. The repo.gitignorealready excludes*.yamland*.envto prevent accidental leaks. Treat the*.example.yamlfiles as templates only — their values are placeholders, not defaults.
[Unit]
Description=RabbitWatch metrics consumer
After=network.target openvpn-client@VPNConfig.service
Requires=openvpn-client@VPNConfig.service
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/usr/bin/python3 /home/ubuntu/metrics_consumer.py --config /home/ubuntu/config_consumer.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=multi-user.target.
├── fastapi_monitor.py # active health checks + REST /monitor
├── api/ # thin FastAPI wiring
├── agents/ # check agents (CLI demo)
├── cp_core/ # Control Plane: controller + recovery logic
├── consumer/ # RabbitMQ consumer + MongoDB TTL indexes
├── producer/ # metrics producer pushing to RabbitMQ
├── service/ # systemd / service integration helpers
├── script/ # one-off operational scripts
├── docs/ # additional documentation
├── requirements.txt # pinned Python deps
├── SECURITY.md # threat model + reporting
└── README.md
See SECURITY.md for the threat model, in-scope / out-of-scope definitions, deployment hardening guidance, and the private reporting channel for vulnerabilities.
The short version:
- Run RabbitWatch behind a trusted network boundary (VPN, VPC, or Cloudflare Tunnel) — the
/monitorendpoint is not currently authenticated. - Rotate every credential from the example configs before production.
- Grant the MongoDB user least-privilege access to the metrics database only.
- Stack won't start:
docker compose logs -f— most issues are either a missingmonitoringDocker network or a placeholder still sitting inmonitor_settings.yaml. /monitorreturnsokbut Grafana is empty: check Prometheus is scraping the exporters (/targetspage), and that the consumer is running (systemd status or the container log).- Alertmanager silent: verify
alertmanager.ymlis mounted into the container anddocker compose restart alertmanagerafter edits.
- Optional API-key or mTLS authentication on
/monitor - GitHub Actions CI: ruff + pip-audit + bandit on every PR
- Dependabot configuration for weekly dependency hygiene
- Helm chart for Kubernetes deployments (currently Docker Compose only)
Released under the MIT License.
Marco Bellingeri (MK023) — Cloud Platform & Security Engineer. Contributions, issues, and discussions are welcome.