A local-first Retrieval-Augmented Generation (RAG) platform for document question answering.
Upload PDF documents, ask questions, and receive answers grounded in your documents, all running locally with no external API keys required by default.
This project is designed as a small AI platform rather than a single demo script: it includes ingestion, chunking, embeddings, vector search, reranking, query rewriting, answer generation, source attribution, and optional LLM-based analysis. The system is built with a service-oriented architecture and pluggable AI backends so core components can be swapped without changing the public APIs.
- Why This Project Stands Out
- Demo Guide
- Features
- Local AI Stack
- Implemented vs Planned
- Architecture
- Prerequisites
- Quick Start
- Local Development
- API Endpoints
- Configuration
- Development
- Project Policies
- Troubleshooting
- Project Structure
- Security and Safeguards
- Roadmap and Future Improvements
- License
- Runs fully local by default with Hugging Face (sentence-transformers), Qdrant, and
llama.cpp - Implements a complete end-to-end RAG pipeline instead of simple prompt forwarding
- Includes practical quality layers such as query rewriting, reranking, safeguards, and optional ML analysis
- Uses a modular multi-service architecture with config-driven backends
- Exposes the same platform through a React UI, Gateway API, and MCP tools for AI agents
- Demonstrates full-stack delivery with FastAPI services, Express Gateway, Docker workflow, CI, and local development docs
See docs/demo.md. It includes a short demo flow, sample prompts, and what to point out when presenting the project.
For a long-form engineering walkthrough written as an article, see docs/ARTICLE.md.
For the full documentation map, see docs/README.md.
- PDF document ingestion for searchable knowledge bases
- Raw text ingestion for automation and MCP-driven workflows
- Semantic retrieval with embeddings and vector search
- Grounded answer generation with source attribution
- Local-first question answering with no external API keys required by default
- Optional query rewriting for short or vague user questions
- Optional BGE reranking for better context selection
- Configurable retrieval depth such as
TOP_K,VECTOR_SEARCH_TOP_K, andRERANK_TOP_K
- Configurable input safeguards for prompt injection and blocked topics
- Configurable output safeguards for sensitive or disallowed responses
- Optional ML service for prompt injection detection, query classification, retrieval scoring, and document classification
- Multi-service monorepo with clear service boundaries
- Python
FastAPIbackend services for ingestion, embedding, retrieval, RAG, and ML - Express + TypeScript Gateway for frontend-facing API access
- React frontend for upload and chat workflows
- MCP server so AI agents can search, ask questions, and ingest content as tools
- Config-driven backends for embeddings, vector databases, and LLM providers
- Default local stack with Hugging Face (sentence-transformers), Qdrant, and
llama.cpp - Alternative and extensible backends including
pgvector,openai, and Bedrock-oriented extension points
- Docker workflow for running the stack locally
- Local development guide for running services individually
- OpenAPI docs exposed by the Gateway
- Tests, linting, CI, contributing guide, and security policy
The default stack runs fully locally:
| Component | Technology |
|---|---|
| Embeddings | Hugging Face (BAAI/bge-small-en-v1.5) |
| Vector Database | Qdrant |
| LLM | Mistral 7B via llama.cpp |
No external API keys required.
| Area | Implemented now | Planned / future |
|---|---|---|
| Ingestion | PDF ingestion, raw text ingestion, chunking, embedding, vector upsert | More ingestion formats and richer preprocessing |
| Retrieval | Embeddings, Qdrant search, source attribution | Hybrid retrieval (vector + keyword) |
| Retrieval quality | Query rewriting, optional BGE reranking | More advanced rerankers and query routing |
| Safety and ML | Input safeguards, output safeguards, optional ML analysis and scoring | Stronger moderation, richer policy controls, deeper evaluation |
| LLM backends | llama.cpp and openai backends, Bedrock placeholders |
Additional production-ready provider backends |
| Vector backends | qdrant and pgvector |
More vector store providers if needed |
| Interfaces | React UI, Gateway API, MCP server tools | Broader agent workflows and external tool orchestration |
| Platform operations | Docker workflow, local development docs, tests, linting, OpenAPI docs | Observability, tracing, and multi-tenant support |
The system uses a service-based RAG architecture where each service is responsible for a specific part of the AI pipeline.
flowchart TB
subgraph clients [Clients]
UI[React UI]
Agent[AI Agent]
end
subgraph access [Access Layer]
GW[Express Gateway]
MCP[MCP Server]
end
subgraph services [Backend Services]
Ingestion[Ingestion Service]
Embedding[Embedding Service]
Retrieval[Retrieval Service]
RAG[RAG Orchestration Service]
ML[ML Service]
end
subgraph external [External Components]
Qdrant[Qdrant Vector DB]
LLM[llama.cpp LLM Server]
end
UI -->|"/ingest /chat"| GW
Agent -->|stdio tools| MCP
GW -->|POST /ingest| Ingestion
GW -->|POST /ask| RAG
MCP -->|POST /ingest /ingest/text| Ingestion
MCP -->|POST /search| Retrieval
MCP -->|POST /ask| RAG
Ingestion -->|POST /embed| Embedding
Ingestion -->|POST /upsert| Retrieval
RAG -->|POST /embed| Embedding
RAG -->|POST /search| Retrieval
RAG -->|POST /analyze /score| ML
Retrieval --> Qdrant
RAG -->|completion| LLM
ML -->|completion| LLM
For service APIs, repository layout, data flow, and MCP integration, see docs/architecture.md.
For config-driven provider backends and extension points, see docs/backends.md.
- Python 3.11+
- Node.js 18+ (frontend and Gateway)
- Docker (for Qdrant)
- llama.cpp server with a Mistral 7B GGUF model
- Qdrant (vector database; runs in Docker)
Do these once after cloning the repo:
- Optional: Copy the env template:
cp .env.example .env(see Optional configuration below if you need to override defaults). - Local LLM: Run
make init-llama, then download a GGUF model and place it inmodels/. See models/README.md for the default filename and download link. - Then follow the steps below to start the backend, LLM, and frontend.
If you want to override defaults, create a repo-wide env file first:
cp .env.example .envBoth Docker Compose and the Python backend services read the root .env. If you are happy with the defaults, you can skip this step.
From the project root:
make upThis starts Gateway, Ingestion, Embedding, Retrieval, RAG, ML, and Qdrant in Docker. Data persists in the qdrant_data volume. Set LLM_URL if your llama.cpp server is not at http://localhost:8080 (e.g. on Mac/Windows use LLM_URL=http://host.docker.internal:8080).
In another terminal, run the LLM on your host (required for chat):
If this is your first run, set up the local LLM runtime once with make init-llama and place a GGUF model in models/. See models/README.md for the expected filename and alternatives.
make llmmake frontendOpen http://localhost:5173. The Gateway API is at http://localhost:8000.
When everything is up and running locally, it looks like this:
To rebuild images after code changes: make build then make up. To stop containers but keep data: make down. To remove containers and the volume: make down-vol.
Run services on your machine with only Qdrant in Docker. The Gateway runs in Node.js; the other backend services run in Python.
First-time setup: Create the backend virtual environment and install dependencies (see Local development → “Backend Python environment”). Without this, make run-backends will fail with “venv/bin/uvicorn: No such file or directory”.
Order:
make qdrant
make run-embedding
make run-retrieval
make run-ingestion
make run-rag
make run-ml # optional when ML_SERVICE_ENABLED=true
make run-gateway
make llm
make frontendFor the full step-by-step, prerequisites, and port table, see Local development.
| Endpoint | Method | Description |
|---|---|---|
/chat/ |
POST | Ask a question; returns answer and sources |
/ingest/ |
POST | Upload a PDF |
/health |
GET | Health check |
OpenAPI docs: When the Gateway is running, open http://localhost:8000/openapi/docs to explore and try the endpoints. The spec is at http://localhost:8000/openapi.json.
For a repo-wide configuration file used by Docker Compose and the Python services, copy the root template:
cp .env.example .envThe root .env is the recommended place to set values such as LLM_URL, ML_SERVICE_ENABLED, SAFEGUARD_ENABLED, and provider-specific settings. The file backend/.env.example is kept as a backend-focused reference, but the root .env is the primary config file for this repository.
| Variable | Default | Description |
|---|---|---|
| LLM_URL | http://localhost:8080 | llama.cpp server |
| QDRANT_HOST | localhost | Qdrant host |
| QDRANT_PORT | 6333 | Qdrant port |
| COLLECTION_NAME | documents | Vector collection |
| TOP_K | 3 | Number of retrieved chunks |
| RERANKER_PROVIDER | bge (use none to disable) |
Optional reranker; BGE cross-encoder |
| VECTOR_SEARCH_TOP_K | 20 | Candidates fetched when reranker enabled |
| RERANK_TOP_K | 3 | Chunks passed to LLM after rerank |
| EMBEDDING_BACKEND | huggingface (alias: local) | Embedding backend; see docs/huggingface.md |
| EMBEDDING_MODEL | BAAI/bge-small-en-v1.5 | Hugging Face model id (e.g. author/repo) |
| EMBEDDING_NORMALIZE | true | Normalize embeddings (recommended for cosine similarity) |
| EMBEDDING_DEVICE | (unset) | Optional device for embeddings (e.g. cpu, cuda) |
| EMBEDDING_BATCH_SIZE | 32 | Batch size for embedding requests |
| EMBEDDING_MAX_LENGTH | (unset) | Optional truncation length for embedding model input |
| CHUNK_SIZE | 800 | Chunk size |
Re-embedding note: If you change EMBEDDING_MODEL or EMBEDDING_NORMALIZE, you must delete/version existing vectors and re-run ingestion. Also ensure the Retrieval service VECTOR_SIZE matches the embedding dimension.
From the project root, run make test to run backend, gateway, and MCP tests. For individual suites: make test-backend, make test-gateway, make test-mcp. See the Makefile for details.
make lint
make test
cd frontend && npm run buildRAG container exits at startup (Docker): Rebuild the RAG image so the in-container layout and services package are correct: docker compose build --no-cache rag, then docker compose up rag. If it still fails, run docker compose logs rag and check the last lines of the traceback.
NumPy/PyTorch: If the API fails to start with a NumPy/PyTorch compatibility error (e.g. "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x"), reinstall dependencies with the venv activated: pip install -r requirements.txt. The project pins numpy>=1.24,<2 for compatibility with PyTorch and sentence-transformers.
NumPy 2.x upgrade path: The project currently pins NumPy to <2 for compatibility with sentence-transformers and PyTorch. When those dependencies support NumPy 2.x, you can upgrade by: (1) updating the numpy constraint to numpy>=2 (or removing the upper bound) in all backend requirement files that reference it (backend/requirements.txt, backend/services/embedding/requirements.txt, backend/services/retrieval/requirements.txt); (2) running make test and manual smoke tests (embedding, reranker, RAG) to confirm nothing breaks. See each dependency’s release notes for compatibility status.
document_rag/
├── backend/
│ ├── shared/ # Chunker, PDF parser, prompt builder
│ ├── services/
│ │ ├── gateway/
│ │ ├── ingestion/
│ │ ├── embedding/
│ │ ├── retrieval/
│ │ ├── rag/
│ │ └── ml/
│ ├── scripts/
│ ├── tests/
│ └── requirements.txt
│
├── frontend/
├── mcp_service/
│
├── docs/
│ ├── README.md
│ ├── ARTICLE.md
│ ├── architecture.md
│ ├── backends.md
│ ├── demo.md
│ ├── local-development.md
│ ├── mcp.md
│ ├── ml_service.md
│ ├── query_rewriter.md
│ └── safeguards.md
│
├── docker-compose.yml
├── Makefile
└── README.md
The RAG pipeline is protected by configurable input and output safeguards that run inside the RAG service. They block prompt injection attempts, disallowed topics, and sensitive content in responses. Safeguards can be enabled/disabled and the provider (e.g. pattern-based basic) can be swapped via environment variables. All rules are centralized in backend/shared/safeguard_constants.py. See docs/safeguards.md for configuration and how to add new safeguard providers.
Future versions may add:
- role-based access control
- PII detection during ingestion
- AI-based or moderation-API safeguard providers
Planned improvements include:
- Hybrid retrieval (vector + keyword search)
- Conversation memory
- Query routing or orchestration to decide between direct LLM use, RAG retrieval, or external tools
- Observability and evaluation for latency, retrieval quality, answer quality, and hallucination tracking
- Multi-tenant document collections
- Support for additional AI providers
- Stronger policy and security layers such as role-based access control and richer safeguard providers
MIT


