Skip to content

marcospolanco/agentel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentel

CI Status: In Progress License: MIT Python 3.11+

AI-powered incident diagnosis for on-call engineers.

Agentel is a multi-agent system that automatically triages and diagnoses production incidents across microservices platforms. It turns noisy telemetry into actionable decisions with measurable confidence.


What It Does

When a SEV-1 incident hits, you need answers fast. Agentel:

  • Identifies the root cause — What broke and why
  • Shows blast radius — Which services and users are affected
  • Recommends safe remediation — What to do, with safety checks
  • Knows when to escalate — When uncertainty is too high for automation

Target: Reduce MTTR for checkout-impacting incidents by 40%.

Quick Start

# Clone and install
git clone https://github.com/marcospolanco/agentel.git
cd agentel
pip install -r requirements.txt

# Run the dashboard
streamlit run ui/dashboard.py

# Run tests
pytest tests/ -v

# Run evaluation harness
python evals/eval_harness.py --all

Demo

Dashboard Root cause identified with confidence, blast radius flow card, and recommended action

Architecture

Agentel uses a 3-agent pipeline focused on guiding SRE attention:

Agent Responsibility
Context Maps service dependencies, calculates blast radius
Diagnosis Analyzes telemetry to identify root cause
Validation Checks remediations against architectural constraints
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Incident  │───>│  Context    │───>│  Diagnosis  │───>│ Validation  │
│   Telemetry │    │   Agent     │    │   Agent     │    │   Agent     │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                             │                    │
                                             ▼                    ▼
                                      ┌─────────────┐    ┌─────────────┐
                                      │  Root Cause │    │  Safety     │
                                      │  Hypothesis │    │  Check      │
                                      └─────────────┘    └─────────────┘

Key design principle: Attention prioritization. The system surfaces exactly what you need (≤3 evidence items) and hides the noise (suppressed metrics).

Operator-facing output: The dashboard uses on-call language (root cause, blast radius) — not raw metric or schema names. CI enforces this with system leak checks (see Evaluation).

Project Status

Overall: Reference implementation — the core diagnosis pipeline, semantic eval harness, and Stitch-backed dashboard are in place. Live LLM diagnosis and full UI polish (rollback modal, loading morph) remain open work.

Based on agentel-spec.md v2.1.0:

Phase Status Description
3.0 ✅ Complete Runtime flow & API contract (Orchestrator.diagnose())
1 ✅ Complete Foundations — models, vocabulary, topology loader
2 🚧 Partial 3-agent core; rule-based diagnosis (OpenAI LLM path not wired)
3 ✅ Complete Eval harness & semantic fitness tests
4 🚧 Partial DashboardView + Stitch templates in Streamlit; modal/rollback UX pending

Phase 4 breakdown:

Sub-phase Status Deliverable
4a ✅ Complete DashboardView + build_dashboard_view()
4b ✅ Complete Golden incidents (INC-2026-001, -002-partial, -003-approval)
4c 🚧 Partial ui/stitch_renderer.py + Streamlit dashboard; runtime smoke & modal behavior pending

CI: GitHub Actions runs pytest tests/ -v and python evals/eval_harness.py --all on every push/PR to main (Python 3.11–3.13).

Last verified: 2026-06-15 — pytest tests/ -v (20 passed) · python evals/eval_harness.py --all (3/3 passed)

Evaluation

Agentel includes a rigorous evaluation harness that measures:

  • Root cause accuracy — Semantic similarity against expected causes
  • Attention Focus Index (AFI) — Measures prioritization (target: ≥0.80)
  • Confidence Calibration Error (CCE) — How well confidence matches correctness (target: ≤0.15)
  • System leak checks — Ensures no blocked technical terms leak to UI
python evals/eval_harness.py --all --report

Performance Guarantees

Metric Target
Diagnosis timeout ≤20 seconds
Tokens per session ≤40k
Topology traversal Offline (no external HTTP)
Interactive elements (primary view) ≤7

Contributing

We welcome contributions!

Good first issues:

  • Wire live OpenAI LLM calls to DiagnosisAgent (currently rule-based fallback)
  • Add suppressed metrics expander to dashboard UI
  • Implement rollback confirmation modal with domain vocabulary

Areas for contribution:

  • Additional golden incident scenarios (data/golden_incidents/)
  • Platform-specific telemetry adapters (Prometheus, Jaeger, OpenTelemetry)
  • UI polish per agentel-ui-brief.md

Documentation

Document Purpose
agentel-spec.md Canonical specification — technical & semantic requirements
agentel-ui-brief.md UX design brief for Google Stitch (compact semantics)

License

MIT © 2026


Built with a focus on correctness, observability, and measurable reliability.

About

AI-powered incident diagnosis for on-call engineers. Turns noisy telemetry into actionable decisions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors