AI-powered incident diagnosis for on-call engineers.
Agentel is a multi-agent system that automatically triages and diagnoses production incidents across microservices platforms. It turns noisy telemetry into actionable decisions with measurable confidence.
When a SEV-1 incident hits, you need answers fast. Agentel:
- Identifies the root cause — What broke and why
- Shows blast radius — Which services and users are affected
- Recommends safe remediation — What to do, with safety checks
- Knows when to escalate — When uncertainty is too high for automation
Target: Reduce MTTR for checkout-impacting incidents by 40%.
# Clone and install
git clone https://github.com/marcospolanco/agentel.git
cd agentel
pip install -r requirements.txt
# Run the dashboard
streamlit run ui/dashboard.py
# Run tests
pytest tests/ -v
# Run evaluation harness
python evals/eval_harness.py --all
Root cause identified with confidence, blast radius flow card, and recommended action
Agentel uses a 3-agent pipeline focused on guiding SRE attention:
| Agent | Responsibility |
|---|---|
| Context | Maps service dependencies, calculates blast radius |
| Diagnosis | Analyzes telemetry to identify root cause |
| Validation | Checks remediations against architectural constraints |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Incident │───>│ Context │───>│ Diagnosis │───>│ Validation │
│ Telemetry │ │ Agent │ │ Agent │ │ Agent │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Root Cause │ │ Safety │
│ Hypothesis │ │ Check │
└─────────────┘ └─────────────┘
Key design principle: Attention prioritization. The system surfaces exactly what you need (≤3 evidence items) and hides the noise (suppressed metrics).
Operator-facing output: The dashboard uses on-call language (root cause, blast radius) — not raw metric or schema names. CI enforces this with system leak checks (see Evaluation).
Overall: Reference implementation — the core diagnosis pipeline, semantic eval harness, and Stitch-backed dashboard are in place. Live LLM diagnosis and full UI polish (rollback modal, loading morph) remain open work.
Based on agentel-spec.md v2.1.0:
| Phase | Status | Description |
|---|---|---|
| 3.0 | ✅ Complete | Runtime flow & API contract (Orchestrator.diagnose()) |
| 1 | ✅ Complete | Foundations — models, vocabulary, topology loader |
| 2 | 🚧 Partial | 3-agent core; rule-based diagnosis (OpenAI LLM path not wired) |
| 3 | ✅ Complete | Eval harness & semantic fitness tests |
| 4 | 🚧 Partial | DashboardView + Stitch templates in Streamlit; modal/rollback UX pending |
Phase 4 breakdown:
| Sub-phase | Status | Deliverable |
|---|---|---|
| 4a | ✅ Complete | DashboardView + build_dashboard_view() |
| 4b | ✅ Complete | Golden incidents (INC-2026-001, -002-partial, -003-approval) |
| 4c | 🚧 Partial | ui/stitch_renderer.py + Streamlit dashboard; runtime smoke & modal behavior pending |
CI: GitHub Actions runs pytest tests/ -v and python evals/eval_harness.py --all on every push/PR to main (Python 3.11–3.13).
Last verified: 2026-06-15 — pytest tests/ -v (20 passed) · python evals/eval_harness.py --all (3/3 passed)
Agentel includes a rigorous evaluation harness that measures:
- Root cause accuracy — Semantic similarity against expected causes
- Attention Focus Index (AFI) — Measures prioritization (target: ≥0.80)
- Confidence Calibration Error (CCE) — How well confidence matches correctness (target: ≤0.15)
- System leak checks — Ensures no blocked technical terms leak to UI
python evals/eval_harness.py --all --report| Metric | Target |
|---|---|
| Diagnosis timeout | ≤20 seconds |
| Tokens per session | ≤40k |
| Topology traversal | Offline (no external HTTP) |
| Interactive elements (primary view) | ≤7 |
We welcome contributions!
Good first issues:
- Wire live OpenAI LLM calls to DiagnosisAgent (currently rule-based fallback)
- Add suppressed metrics expander to dashboard UI
- Implement rollback confirmation modal with domain vocabulary
Areas for contribution:
- Additional golden incident scenarios (
data/golden_incidents/) - Platform-specific telemetry adapters (Prometheus, Jaeger, OpenTelemetry)
- UI polish per
agentel-ui-brief.md
| Document | Purpose |
|---|---|
agentel-spec.md |
Canonical specification — technical & semantic requirements |
agentel-ui-brief.md |
UX design brief for Google Stitch (compact semantics) |
MIT © 2026
Built with a focus on correctness, observability, and measurable reliability.