Skip to content

Githubdiaries/indica

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Indic Voice Agent

An open-source TypeScript starter for a real-time voice agent.

It captures microphone audio in the browser, streams it to a Node.js/Express backend, transcribes speech with Sarvam STT, retrieves local Markdown knowledge, asks Gemini for a grounded answer, and returns speech with Sarvam TTS.

What this project is

This repo is intentionally small and readable. It is meant to be a practical LLM-engineering portfolio project, not a full production platform.

Current features:

  • Real-time microphone capture in the browser
  • WebSocket audio streaming to Express
  • Sarvam streaming STT
  • Unicode-safe transcript harness
  • In-memory session memory
  • Local Markdown RAG from data/
  • Optional Gemini online lookup when saved knowledge contains URLs or coordinates
  • Gemini answer generation
  • Sarvam TTS audio response
  • STT-only diagnostic page at /stt-test

Not included yet:

  • Authentication
  • Database
  • Vector database
  • Graph database
  • Docker
  • Frontend framework
  • Persistent memory

Architecture

Browser microphone
  -> WebSocket
  -> Sarvam STT
  -> Harness
  -> Memory
  -> Local Markdown RAG
  -> Gemini
  -> Sarvam TTS
  -> Audio response

The harness keeps responses grounded and natural. If the knowledge base contains a map link or coordinates, the assistant should explain the place like a person would: main area, city, nearby landmark, and simple context when available.

Project structure

data/
  knowledge-base.md        Local knowledge used by RAG
public/
  index.html               Main voice UI
  app.js                   Browser microphone + WebSocket client
  pcm-worklet.js           AudioWorklet PCM capture
  stt-test.html            STT diagnostic page
src/
  index.ts                 Express + WebSocket server
  harness/                 Transcript cleanup and prompt building
  memory/                  In-memory session and audio storage
  orchestration/           Voice-agent pipeline
  rag/                     Local Markdown retriever
  services/                Sarvam, Gemini, and lookup adapters
  types.ts                 Shared TypeScript types

Requirements

  • Node.js 20+
  • Sarvam API key
  • Gemini API key
  • A browser with microphone access

Setup

Install dependencies:

npm install

Create your environment file:

copy .env.example .env

On macOS/Linux:

cp .env.example .env

Fill in .env:

PORT=3000
SARVAM_API_KEY=your_sarvam_key
SARVAM_STT_LANGUAGE_CODE=en-IN
GEMINI_API_KEY=your_gemini_key
GEMINI_MODEL=gemini-3.5-flash
GEMINI_FALLBACK_MODEL=gemini-2.5-flash
WEB_SEARCH_ENABLED=true

Never commit .env. Use .env.example for safe public configuration.

Run

npm start

Open:

http://localhost:3000

Click the microphone button, speak, then stop recording.

STT diagnostic page

For speech-to-text debugging without Gemini or TTS:

http://localhost:3000/stt-test

This page shows connection state, audio chunk telemetry, raw provider events, partial transcripts, final transcripts, and detected language.

Local knowledge

Put project or domain knowledge in:

data/knowledge-base.md

The retriever reads supported files inside data/:

  • .md
  • .mdx
  • .txt

Keep knowledge simple and factual. For example:

TinkerSpace location: 21/258, Seaport-Airport Road, Vidya Nagar Colony, Kalamassery, Kochi, Kerala. It is near the HMT/Thrikkakara side of Kochi.

If you store a URL, the assistant can use online lookup to understand it and answer naturally instead of dumping the raw link.

Scripts

npm start          # run the app
npm run dev        # run with tsx watch
npm run typecheck  # verify TypeScript

Environment variables

Variable Required Purpose
PORT No Express port. Defaults to 3000.
SARVAM_API_KEY Yes Sarvam STT/TTS API key.
SARVAM_STT_WS_URL No Diagnostic direct WebSocket URL.
SARVAM_STT_LANGUAGE_CODE No STT language code. Defaults to en-IN.
GEMINI_API_KEY Yes Gemini API key.
GEMINI_MODEL No Primary Gemini model.
GEMINI_FALLBACK_MODEL No Fallback model for temporary model errors.
GEMINI_MAX_RETRIES No Retry count for temporary Gemini failures.
AUDIO_TTL_MS No How long generated audio stays in memory.
WEB_SEARCH_ENABLED No Set false to disable online lookup.

Privacy notes

  • Microphone audio is streamed to the backend and Sarvam STT.
  • Input audio is not written to disk.
  • Generated TTS audio is stored briefly in memory and expires automatically.
  • Session memory is in-memory only and resets when the server restarts.
  • Do not commit .env, API keys, logs, uploads, or generated audio.

Contributing

Contributions are welcome. Please keep the project beginner-friendly:

  • Prefer small, readable modules.
  • Avoid adding databases or infrastructure unless the issue requires it.
  • Run npm run typecheck before opening a pull request.
  • Keep provider-specific details inside src/services/.
  • Do not commit secrets or generated files.

Roadmap

  • Better URL and map-link understanding
  • RAG over larger Markdown collections
  • Optional embeddings and vector DB
  • Web search citations
  • Persistent memory
  • Authentication
  • Deployment recipe

License

MIT. See LICENSE.

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors