RabbitWatch

Self-healing monitoring stack: FastAPI health checks + Prometheus/Grafana/Alertmanager + RabbitMQ event bus, with automated recovery for cloud/on-prem infrastructure.

RabbitWatch is a small, opinionated control-plane that keeps your critical services (VPN, NAS, message brokers, databases, dashboards, VMs) up and self-healing. A FastAPI service runs periodic checks; when something fails, events flow through RabbitMQ to a Control Plane (CPController) that decides whether to retry, recover, or escalate — and pushes the resulting metrics to Prometheus/Grafana.

It is designed as a drop-in observability + recovery layer for small-to-medium Linux fleets that can't justify a full commercial APM, but still need actionable alerting and hands-off remediation.

Why

Built to solve a real problem: keeping a small Linux fleet (VPN, NAS, RabbitMQ, MongoDB, Grafana) up and self-healing — without paying for a commercial APM like Datadog or New Relic. RabbitWatch is the drop-in observability + recovery layer I wanted to exist for small-to-medium fleets that can't justify enterprise tooling but still need actionable alerting and hands-off remediation.

The architecture intentionally mirrors enterprise patterns (event-driven decoupling via RabbitMQ, Prometheus/Grafana standard stack, separation between check logic and recovery logic) — proving the same design ideas work at hobby scale and at production scale, with no architectural rewrite in between.

Operating cost

Designed for minimal self-hosting — runs on a single small VPS:

Component	Cost	Notes
Linux host (1 vCPU / 2 GB RAM)	~$5/month	Any small VPS — Hetzner CX11, Vultr, OVH Eco
All containers (FastAPI, RabbitMQ, Prometheus, Grafana, MongoDB, Portainer)	$0	Self-hosted via Docker Compose
Optional: Cloudflare Tunnel	Free	If you want HTTPS without exposing ports

The whole stack fits in ~1.5 GB RAM. The point isn't to compete with $50+/month commercial APMs on features — it's to demonstrate that the same architectural patterns (event-driven, separation of concerns, exporters + scrape model) can run on a single $5 VM when you do the engineering yourself.

Architecture

flowchart LR
  U["Admin / DevOps"] -->|"GET /monitor"| API

  subgraph API["FastAPI Monitor"]
    HC["Periodic health checks<br/>(TCP · HTTP · MongoDB)"]
    EP["REST endpoint /monitor"]
  end

  HC -->|"KO events"| CP
  subgraph CP["Control Plane"]
    CTRL["CPController<br/>classification"]
    REC["Recovery / escalation"]
    CTRL --> REC
  end

  HC -->|"events + metrics"| MQ
  subgraph MQ["RabbitMQ"]
    QS["queues"]
    PC["Python producer / consumer"]
    QS <--> PC
  end

  PC -->|"write"| DB[("MongoDB<br/>metrics history")]

  subgraph OBS["Observability"]
    EX["Node + MongoDB<br/>exporters"]
    PR["Prometheus"]
    AM["Alertmanager"]
    GF["Grafana<br/>dashboards"]
    EX --> PR --> GF
    PR --> AM
  end

  HC -.->|"scrape"| PR
  PO["Portainer"] -.->|"manages"| API
  PO -.->|"manages"| MQ
  PO -.->|"manages"| OBS

Features

Active health checks — TCP, HTTP (with basic auth), and MongoDB reachability against a YAML-declared set of endpoints.
Event-driven recovery — failures are published to RabbitMQ; the CPController decides on retry / escalation strategy without blocking the check loop.
Metrics pipeline — a Python producer pushes structured metrics to MongoDB (with TTL indexes for retention) and to Prometheus via exporters; Grafana visualizes them.
Background thread — health checks run continuously without external schedulers; the REST endpoint just exposes the latest aggregate.
Portainer-friendly — every component is a standalone container managed via docker-compose; Portainer gives a visual UI if you want one.
Extensible — add a new service type by extending fastapi_monitor.py and the YAML schema.
Hardenable — deployment guidance and threat model are documented in SECURITY.md.

Stack

Layer	Component
HTTP monitor + REST API	FastAPI, Uvicorn, `requests`
Event bus	RabbitMQ + exporters
Metrics store	MongoDB + MongoDB exporter
Metrics collection	Prometheus + Alertmanager
Visualization	Grafana
Orchestration	Docker Compose + systemd
Container management	Portainer (optional)

Pinned Python dependencies live in requirements.txt. Container versions are pinned in docker-compose.yml.

Quick start

Prerequisites: Docker (+ Compose plugin), a Linux host with at least 2 GB RAM, and one free port for the monitor (default 8000).

Create the Docker network (first run only):
```
docker network create monitoring
```

Copy and edit the config:

cp monitor_settings.example.yaml monitor_settings.yaml
# then edit monitor_settings.yaml with your endpoints and credentials

Bring the stack up:
```
docker compose up -d
```

Hit the monitor:

curl http://localhost:8000/monitor

Sample response:

{
  "vpn": "ok",
  "nas": "ok",
  "rabbitmq": "ok",
  "prometheus": "ok",
  "grafana": "ok",
  "portainer": "ok",
  "mongodb": "ok",
  "ec2_tcp": "ok",
  "all_critical_ok": true
}

If all_critical_ok is false, the failing service name is the field to check, and a KO event will have already been published to RabbitMQ for the Control Plane to handle.

What you can monitor out of the box

VPNs and tunnels (TCP reachability)
NAS and file servers (HTTP endpoints)
RabbitMQ queues (management API + exporter)
Prometheus, Grafana, Portainer (health APIs)
MongoDB clusters (native driver)
EC2 or any VM (TCP + optional HTTP)

Anything else is one extension of monitor_settings.yaml + one check function in fastapi_monitor.py away.

Configuration example

vpn_host: "YOUR_VPN_IP"
vpn_port: 1194

nas_url: "http://YOUR_NAS_IP:9100/metrics"

rabbitmq_api: "http://rabbitmq:15672/api/health/checks/alarms"
rabbitmq_user: "youruser"
rabbitmq_pass: "yourpassword"

prometheus_url: "http://prometheus:9090/-/healthy"
grafana_url:    "http://grafana:3000/api/health"
portainer_url:  "http://portainer:9000/api/status"

mongodb_uri: "mongodb+srv://youruser:yourpassword@yourcluster.mongodb.net/?authSource=admin"

ec2_host: "YOUR_EC2_IP"
ec2_port: 22
ec2_http_url: null

Never commit the real monitor_settings.yaml. The repo .gitignore already excludes *.yaml and *.env to prevent accidental leaks. Treat the *.example.yaml files as templates only — their values are placeholders, not defaults.

Running the Python consumer as a systemd service

[Unit]
Description=RabbitWatch metrics consumer
After=network.target openvpn-client@VPNConfig.service
Requires=openvpn-client@VPNConfig.service

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/usr/bin/python3 /home/ubuntu/metrics_consumer.py --config /home/ubuntu/config_consumer.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

Repository layout

.
├── fastapi_monitor.py    # active health checks + REST /monitor
├── api/                  # thin FastAPI wiring
├── agents/               # check agents (CLI demo)
├── cp_core/              # Control Plane: controller + recovery logic
├── consumer/             # RabbitMQ consumer + MongoDB TTL indexes
├── producer/             # metrics producer pushing to RabbitMQ
├── service/              # systemd / service integration helpers
├── script/               # one-off operational scripts
├── docs/                 # additional documentation
├── requirements.txt      # pinned Python deps
├── SECURITY.md           # threat model + reporting
└── README.md

Security

See SECURITY.md for the threat model, in-scope / out-of-scope definitions, deployment hardening guidance, and the private reporting channel for vulnerabilities.

The short version:

Run RabbitWatch behind a trusted network boundary (VPN, VPC, or Cloudflare Tunnel) — the /monitor endpoint is not currently authenticated.
Rotate every credential from the example configs before production.
Grant the MongoDB user least-privilege access to the metrics database only.

Troubleshooting

Stack won't start: docker compose logs -f — most issues are either a missing monitoring Docker network or a placeholder still sitting in monitor_settings.yaml.
/monitor returns ok but Grafana is empty: check Prometheus is scraping the exporters (/targets page), and that the consumer is running (systemd status or the container log).
Alertmanager silent: verify alertmanager.yml is mounted into the container and docker compose restart alertmanager after edits.

Roadmap

Optional API-key or mTLS authentication on /monitor
GitHub Actions CI: ruff + pip-audit + bandit on every PR
Dependabot configuration for weekly dependency hygiene
Helm chart for Kubernetes deployments (currently Docker Compose only)

License

Released under the MIT License.

Author

Marco Bellingeri (MK023) — Cloud Platform & Security Engineer. Contributions, issues, and discussions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RabbitWatch

Why

Operating cost

Architecture

Features

Stack

Quick start

What you can monitor out of the box

Configuration example

Running the Python consumer as a systemd service

Repository layout

Security

Troubleshooting

Roadmap

License

Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
agents		agents
consumer		consumer
cp_core		cp_core
docs		docs
producer		producer
script		script
service		service
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
fastapi_monitor.py		fastapi_monitor.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RabbitWatch

Why

Operating cost

Architecture

Features

Stack

Quick start

What you can monitor out of the box

Configuration example

Running the Python consumer as a systemd service

Repository layout

Security

Troubleshooting

Roadmap

License

Author

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages