An open-source TypeScript starter for a real-time voice agent.
It captures microphone audio in the browser, streams it to a Node.js/Express backend, transcribes speech with Sarvam STT, retrieves local Markdown knowledge, asks Gemini for a grounded answer, and returns speech with Sarvam TTS.
This repo is intentionally small and readable. It is meant to be a practical LLM-engineering portfolio project, not a full production platform.
Current features:
- Real-time microphone capture in the browser
- WebSocket audio streaming to Express
- Sarvam streaming STT
- Unicode-safe transcript harness
- In-memory session memory
- Local Markdown RAG from
data/ - Optional Gemini online lookup when saved knowledge contains URLs or coordinates
- Gemini answer generation
- Sarvam TTS audio response
- STT-only diagnostic page at
/stt-test
Not included yet:
- Authentication
- Database
- Vector database
- Graph database
- Docker
- Frontend framework
- Persistent memory
Browser microphone
-> WebSocket
-> Sarvam STT
-> Harness
-> Memory
-> Local Markdown RAG
-> Gemini
-> Sarvam TTS
-> Audio response
The harness keeps responses grounded and natural. If the knowledge base contains a map link or coordinates, the assistant should explain the place like a person would: main area, city, nearby landmark, and simple context when available.
data/
knowledge-base.md Local knowledge used by RAG
public/
index.html Main voice UI
app.js Browser microphone + WebSocket client
pcm-worklet.js AudioWorklet PCM capture
stt-test.html STT diagnostic page
src/
index.ts Express + WebSocket server
harness/ Transcript cleanup and prompt building
memory/ In-memory session and audio storage
orchestration/ Voice-agent pipeline
rag/ Local Markdown retriever
services/ Sarvam, Gemini, and lookup adapters
types.ts Shared TypeScript types
- Node.js 20+
- Sarvam API key
- Gemini API key
- A browser with microphone access
Install dependencies:
npm installCreate your environment file:
copy .env.example .envOn macOS/Linux:
cp .env.example .envFill in .env:
PORT=3000
SARVAM_API_KEY=your_sarvam_key
SARVAM_STT_LANGUAGE_CODE=en-IN
GEMINI_API_KEY=your_gemini_key
GEMINI_MODEL=gemini-3.5-flash
GEMINI_FALLBACK_MODEL=gemini-2.5-flash
WEB_SEARCH_ENABLED=trueNever commit .env. Use .env.example for safe public configuration.
npm startOpen:
http://localhost:3000
Click the microphone button, speak, then stop recording.
For speech-to-text debugging without Gemini or TTS:
http://localhost:3000/stt-test
This page shows connection state, audio chunk telemetry, raw provider events, partial transcripts, final transcripts, and detected language.
Put project or domain knowledge in:
data/knowledge-base.md
The retriever reads supported files inside data/:
.md.mdx.txt
Keep knowledge simple and factual. For example:
TinkerSpace location: 21/258, Seaport-Airport Road, Vidya Nagar Colony, Kalamassery, Kochi, Kerala. It is near the HMT/Thrikkakara side of Kochi.If you store a URL, the assistant can use online lookup to understand it and answer naturally instead of dumping the raw link.
npm start # run the app
npm run dev # run with tsx watch
npm run typecheck # verify TypeScript| Variable | Required | Purpose |
|---|---|---|
PORT |
No | Express port. Defaults to 3000. |
SARVAM_API_KEY |
Yes | Sarvam STT/TTS API key. |
SARVAM_STT_WS_URL |
No | Diagnostic direct WebSocket URL. |
SARVAM_STT_LANGUAGE_CODE |
No | STT language code. Defaults to en-IN. |
GEMINI_API_KEY |
Yes | Gemini API key. |
GEMINI_MODEL |
No | Primary Gemini model. |
GEMINI_FALLBACK_MODEL |
No | Fallback model for temporary model errors. |
GEMINI_MAX_RETRIES |
No | Retry count for temporary Gemini failures. |
AUDIO_TTL_MS |
No | How long generated audio stays in memory. |
WEB_SEARCH_ENABLED |
No | Set false to disable online lookup. |
- Microphone audio is streamed to the backend and Sarvam STT.
- Input audio is not written to disk.
- Generated TTS audio is stored briefly in memory and expires automatically.
- Session memory is in-memory only and resets when the server restarts.
- Do not commit
.env, API keys, logs, uploads, or generated audio.
Contributions are welcome. Please keep the project beginner-friendly:
- Prefer small, readable modules.
- Avoid adding databases or infrastructure unless the issue requires it.
- Run
npm run typecheckbefore opening a pull request. - Keep provider-specific details inside
src/services/. - Do not commit secrets or generated files.
- Better URL and map-link understanding
- RAG over larger Markdown collections
- Optional embeddings and vector DB
- Web search citations
- Persistent memory
- Authentication
- Deployment recipe
MIT. See LICENSE.