Indic Voice Agent

An open-source TypeScript starter for a real-time voice agent.

It captures microphone audio in the browser, streams it to a Node.js/Express backend, transcribes speech with Sarvam STT, retrieves local Markdown knowledge, asks Gemini for a grounded answer, and returns speech with Sarvam TTS.

What this project is

This repo is intentionally small and readable. It is meant to be a practical LLM-engineering portfolio project, not a full production platform.

Current features:

Real-time microphone capture in the browser
WebSocket audio streaming to Express
Sarvam streaming STT
Unicode-safe transcript harness
In-memory session memory
Local Markdown RAG from data/
Optional Gemini online lookup when saved knowledge contains URLs or coordinates
Gemini answer generation
Sarvam TTS audio response
STT-only diagnostic page at /stt-test

Not included yet:

Authentication
Database
Vector database
Graph database
Docker
Frontend framework
Persistent memory

Architecture

Browser microphone
  -> WebSocket
  -> Sarvam STT
  -> Harness
  -> Memory
  -> Local Markdown RAG
  -> Gemini
  -> Sarvam TTS
  -> Audio response

The harness keeps responses grounded and natural. If the knowledge base contains a map link or coordinates, the assistant should explain the place like a person would: main area, city, nearby landmark, and simple context when available.

Project structure

data/
  knowledge-base.md        Local knowledge used by RAG
public/
  index.html               Main voice UI
  app.js                   Browser microphone + WebSocket client
  pcm-worklet.js           AudioWorklet PCM capture
  stt-test.html            STT diagnostic page
src/
  index.ts                 Express + WebSocket server
  harness/                 Transcript cleanup and prompt building
  memory/                  In-memory session and audio storage
  orchestration/           Voice-agent pipeline
  rag/                     Local Markdown retriever
  services/                Sarvam, Gemini, and lookup adapters
  types.ts                 Shared TypeScript types

Requirements

Node.js 20+
Sarvam API key
Gemini API key
A browser with microphone access

Setup

Install dependencies:

npm install

Create your environment file:

copy .env.example .env

On macOS/Linux:

cp .env.example .env

Fill in .env:

PORT=3000
SARVAM_API_KEY=your_sarvam_key
SARVAM_STT_LANGUAGE_CODE=en-IN
GEMINI_API_KEY=your_gemini_key
GEMINI_MODEL=gemini-3.5-flash
GEMINI_FALLBACK_MODEL=gemini-2.5-flash
WEB_SEARCH_ENABLED=true

Never commit .env. Use .env.example for safe public configuration.

Run

npm start

Open:

http://localhost:3000

Click the microphone button, speak, then stop recording.

STT diagnostic page

For speech-to-text debugging without Gemini or TTS:

http://localhost:3000/stt-test

This page shows connection state, audio chunk telemetry, raw provider events, partial transcripts, final transcripts, and detected language.

Local knowledge

Put project or domain knowledge in:

data/knowledge-base.md

The retriever reads supported files inside data/:

.md
.mdx
.txt

Keep knowledge simple and factual. For example:

TinkerSpace location: 21/258, Seaport-Airport Road, Vidya Nagar Colony, Kalamassery, Kochi, Kerala. It is near the HMT/Thrikkakara side of Kochi.

If you store a URL, the assistant can use online lookup to understand it and answer naturally instead of dumping the raw link.

Scripts

npm start          # run the app
npm run dev        # run with tsx watch
npm run typecheck  # verify TypeScript

Environment variables

Variable	Required	Purpose
`PORT`	No	Express port. Defaults to `3000`.
`SARVAM_API_KEY`	Yes	Sarvam STT/TTS API key.
`SARVAM_STT_WS_URL`	No	Diagnostic direct WebSocket URL.
`SARVAM_STT_LANGUAGE_CODE`	No	STT language code. Defaults to `en-IN`.
`GEMINI_API_KEY`	Yes	Gemini API key.
`GEMINI_MODEL`	No	Primary Gemini model.
`GEMINI_FALLBACK_MODEL`	No	Fallback model for temporary model errors.
`GEMINI_MAX_RETRIES`	No	Retry count for temporary Gemini failures.
`AUDIO_TTL_MS`	No	How long generated audio stays in memory.
`WEB_SEARCH_ENABLED`	No	Set `false` to disable online lookup.

Privacy notes

Microphone audio is streamed to the backend and Sarvam STT.
Input audio is not written to disk.
Generated TTS audio is stored briefly in memory and expires automatically.
Session memory is in-memory only and resets when the server restarts.
Do not commit .env, API keys, logs, uploads, or generated audio.

Contributing

Contributions are welcome. Please keep the project beginner-friendly:

Prefer small, readable modules.
Avoid adding databases or infrastructure unless the issue requires it.
Run npm run typecheck before opening a pull request.
Keep provider-specific details inside src/services/.
Do not commit secrets or generated files.

Roadmap

Better URL and map-link understanding
RAG over larger Markdown collections
Optional embeddings and vector DB
Web search citations
Persistent memory
Authentication
Deployment recipe

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
public		public
src		src
.env.example		.env.example
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indic Voice Agent

What this project is

Architecture

Project structure

Requirements

Setup

Run

STT diagnostic page

Local knowledge

Scripts

Environment variables

Privacy notes

Contributing

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Indic Voice Agent

What this project is

Architecture

Project structure

Requirements

Setup

Run

STT diagnostic page

Local knowledge

Scripts

Environment variables

Privacy notes

Contributing

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages