Skip to content

Commit 80211df

Browse files
rui-renruiren_microsoftCopilotkunal-vaishnavi
authored
Add live audio transcription streaming support to Foundry Local JS SDK (#486)
Here's the corrected PR description with all names aligned to the actual code: --- Adds real-time audio streaming support to the Foundry Local JS SDK, enabling live microphone-to-text transcription via ONNX Runtime GenAI ASR. The existing AudioClient only supports file-based transcription. This PR introduces `LiveAudioTranscriptionSession` that accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as an async iterable. ## What's included ### New files - `src/openai/liveAudioTranscriptionClient.ts` — Streaming session with `start()`, `append()`, `getTranscriptionStream()`, `stop()`, `dispose()` - `src/openai/liveAudioTranscriptionTypes.ts` — `LiveAudioTranscriptionResponse` and `CoreErrorResponse` interfaces, `tryParseCoreError()` helper - `src/detail/coreInterop.ts` — Added `executeCommandWithBinary()` method and `StreamingRequestBuffer` struct for binary PCM data transport - app.js — E2E example with microphone capture (naudiodon2) and synthetic audio fallback - `test/openai/liveAudioTranscription.test.ts` — Unit tests for types/settings and E2E test with synthetic PCM audio ### Modified files - `src/imodel.ts` — Added `createLiveTranscriptionSession()` to interface - `src/model.ts` — Delegates to `selectedVariant.createLiveTranscriptionSession()` - `src/modelVariant.ts` — Implementation (creates new `LiveAudioTranscriptionSession(modelId, coreInterop)`) - `src/index.ts` — Exports `LiveAudioTranscriptionSession`, `LiveAudioTranscriptionOptions`, `LiveAudioTranscriptionResponse`, `TranscriptionContentPart` ## API surface ```js const session = model.createLiveTranscriptionSession(); session.settings.sampleRate = 16000; session.settings.channels = 1; session.settings.language = "en"; await session.start(); // Push audio from microphone callback await session.append(pcmBytes); // Read results as async iterable for await (const result of session.getTranscriptionStream()) { console.log(result.content[0].text); } await session.stop(); ``` ## Design highlights - **Internal async push queue** — Bounded `AsyncQueue<T>` serializes audio pushes from any context (safe for mic callbacks) and provides backpressure via FIFO resolver queue. Mirrors C#'s `Channel<T>` pattern. - **Binary data transport** — `executeCommandWithBinary()` sends PCM bytes alongside JSON params via `StreamingRequestBuffer`, with transcription results parsed from push responses. - **Settings freeze** — Audio format settings are snapshot-copied and `Object.freeze()`d at `start()`, immutable during the session - **Buffer copy** — `append()` copies the input `Uint8Array` before queueing, safe when caller reuses buffers - **Drain-on-stop** — `stop()` completes the push queue, waits for the push loop to drain, parses final transcription from stop response, then completes the output stream - **Error propagation** — `start()` failures are propagated to `outputQueue` so `getTranscriptionStream()` consumers see the error; `tryParseCoreError()` handles both raw JSON and CoreInterop-prefixed error messages - **Dispose safety** — `dispose()` wraps `stop()` in try/catch, never throws ## Native core dependency This PR adds the JS SDK surface. The 3 native commands (`audio_stream_start`, `audio_stream_push`, `audio_stream_stop`) are routed through `execute_command` and the new `execute_command_with_binary` exports. The code compiles with zero TypeScript errors without the native library. ## Testing - ✅ TypeScript compilation — 0 errors across all source files - ✅ Unit tests for `parseTranscriptionResult()`, `tryParseCoreError()`, `LiveAudioTranscriptionOptions` - ✅ E2E test with synthetic PCM audio (skips gracefully if native core unavailable) ## Parity with C# SDK This implementation mirrors the C# `LiveAudioTranscriptionSession` with identical logic: - Same session lifecycle: `start` → `append` → `getStream` → `stop` - Same push loop with error handling and binary data transport - Same settings freeze and buffer copy semantics - Same drain-before-stop ordering with final result parsing - Same E2E test pattern (synthetic 440Hz sine wave, 100ms chunks, ConversationItem-shaped response validation) - Same renamed types: `LiveAudioTranscription*` (matching C# rename) --- Changes from the original: | Old (incorrect) | New (matches code) | |---|---| | `LiveAudioTranscriptionClient` | `LiveAudioTranscriptionSession` | | `LiveAudioTranscriptionSettings` | `LiveAudioTranscriptionOptions` | | `LiveAudioTranscriptionResult` | `LiveAudioTranscriptionResponse` | | `createLiveTranscriptionClient()` | `createLiveTranscriptionSession()` | --------- Co-authored-by: ruiren_microsoft <ruiren@microsoft.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
1 parent a742610 commit 80211df

11 files changed

Lines changed: 1018 additions & 0 deletions

File tree

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Live Audio Transcription Example
2+
3+
Real-time microphone-to-text transcription using the Foundry Local JS SDK with Nemotron ASR.
4+
5+
## Prerequisites
6+
7+
- [Foundry Local](https://github.com/microsoft/Foundry-Local) installed
8+
- Node.js 18+
9+
- A microphone (optional — falls back to synthetic audio)
10+
11+
## Setup
12+
13+
```bash
14+
npm install foundry-local-sdk naudiodon2
15+
```
16+
17+
> **Note:** `naudiodon2` is optional — provides cross-platform microphone capture. Without it, the example falls back to synthetic audio for testing.
18+
19+
## Run
20+
21+
```bash
22+
node app.js
23+
```
24+
25+
Speak into your microphone. Transcription appears in real-time. Press `Ctrl+C` to stop.
26+
27+
## How it works
28+
29+
1. Initializes the Foundry Local SDK and loads the Nemotron ASR model
30+
2. Creates a `LiveAudioTranscriptionSession` with 16kHz/16-bit/mono PCM settings
31+
3. Captures microphone audio via `naudiodon2` (or generates synthetic audio as fallback)
32+
4. Pushes PCM chunks to the SDK via `session.append()`
33+
5. Reads transcription results via `for await (const result of session.getTranscriptionStream())`
34+
6. Access text via `result.content[0].text` (OpenAI Realtime ConversationItem pattern)
35+
36+
## API
37+
38+
```javascript
39+
const audioClient = model.createAudioClient();
40+
const session = audioClient.createLiveTranscriptionSession();
41+
session.settings.sampleRate = 16000;
42+
session.settings.channels = 1;
43+
session.settings.language = 'en';
44+
45+
await session.start();
46+
47+
// Push audio
48+
await session.append(pcmBytes);
49+
50+
// Read results
51+
for await (const result of session.getTranscriptionStream()) {
52+
console.log(result.content[0].text); // transcribed text
53+
console.log(result.content[0].transcript); // alias (OpenAI compat)
54+
console.log(result.is_final); // true for final results
55+
}
56+
57+
await session.stop();
58+
```
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
// Live Audio Transcription Example — Foundry Local JS SDK
2+
//
3+
// Demonstrates real-time microphone-to-text using the JS SDK.
4+
// Requires: npm install foundry-local-sdk naudiodon2
5+
//
6+
// Usage: node app.js
7+
8+
import { FoundryLocalManager } from 'foundry-local-sdk';
9+
10+
console.log('╔══════════════════════════════════════════════════════════╗');
11+
console.log('║ Foundry Local — Live Audio Transcription (JS SDK) ║');
12+
console.log('╚══════════════════════════════════════════════════════════╝');
13+
console.log();
14+
15+
// Initialize the Foundry Local SDK
16+
console.log('Initializing Foundry Local SDK...');
17+
const manager = FoundryLocalManager.create({
18+
appName: 'foundry_local_live_audio',
19+
logLevel: 'info'
20+
});
21+
console.log('✓ SDK initialized');
22+
23+
// Get and load the nemotron model
24+
const modelAlias = 'nemotron';
25+
let model = await manager.catalog.getModel(modelAlias);
26+
if (!model) {
27+
console.error(`ERROR: Model "${modelAlias}" not found in catalog.`);
28+
process.exit(1);
29+
}
30+
31+
console.log(`Found model: ${model.id}`);
32+
console.log('Downloading model (if needed)...');
33+
await model.download((progress) => {
34+
process.stdout.write(`\rDownloading... ${progress.toFixed(2)}%`);
35+
});
36+
console.log('\n✓ Model downloaded');
37+
38+
console.log('Loading model...');
39+
await model.load();
40+
console.log('✓ Model loaded');
41+
42+
// Create live transcription session
43+
const audioClient = model.createAudioClient();
44+
const session = audioClient.createLiveTranscriptionSession();
45+
session.settings.sampleRate = 16000; // Default is 16000; shown here for clarity
46+
session.settings.channels = 1;
47+
session.settings.bitsPerSample = 16;
48+
session.settings.language = 'en';
49+
50+
console.log('Starting streaming session...');
51+
await session.start();
52+
console.log('✓ Session started');
53+
54+
// Read transcription results in background
55+
const readPromise = (async () => {
56+
try {
57+
for await (const result of session.getTranscriptionStream()) {
58+
const text = result.content?.[0]?.text;
59+
if (result.is_final) {
60+
console.log();
61+
console.log(` [FINAL] ${text}`);
62+
} else if (text) {
63+
process.stdout.write(text);
64+
}
65+
}
66+
} catch (err) {
67+
if (err.name !== 'AbortError') {
68+
console.error('Stream error:', err.message);
69+
}
70+
}
71+
})();
72+
73+
// --- Microphone capture ---
74+
// This example uses naudiodon2 for cross-platform audio capture.
75+
// Install with: npm install naudiodon2
76+
//
77+
// If you prefer a different audio library, just push PCM bytes
78+
// (16-bit signed LE, mono, 16kHz) via session.append().
79+
80+
let audioInput;
81+
try {
82+
const { default: portAudio } = await import('naudiodon2');
83+
84+
audioInput = portAudio.AudioIO({
85+
inOptions: {
86+
channelCount: session.settings.channels,
87+
sampleFormat: session.settings.bitsPerSample === 16
88+
? portAudio.SampleFormat16Bit
89+
: portAudio.SampleFormat32Bit,
90+
sampleRate: session.settings.sampleRate,
91+
framesPerBuffer: 1600, // 100ms chunks
92+
maxQueue: 15 // buffer during event-loop blocks from sync FFI calls
93+
}
94+
});
95+
96+
let appendPending = false;
97+
audioInput.on('data', (buffer) => {
98+
if (appendPending) return; // drop frame while backpressured
99+
const pcm = new Uint8Array(buffer);
100+
appendPending = true;
101+
session.append(pcm).then(() => {
102+
appendPending = false;
103+
}).catch((err) => {
104+
appendPending = false;
105+
console.error('append error:', err.message);
106+
});
107+
});
108+
109+
console.log();
110+
console.log('════════════════════════════════════════════════════════════');
111+
console.log(' LIVE TRANSCRIPTION ACTIVE');
112+
console.log(' Speak into your microphone.');
113+
console.log(' Press Ctrl+C to stop.');
114+
console.log('════════════════════════════════════════════════════════════');
115+
console.log();
116+
117+
audioInput.start();
118+
} catch (err) {
119+
console.warn('⚠ Could not initialize microphone (naudiodon2 may not be installed).');
120+
console.warn(' Install with: npm install naudiodon2');
121+
console.warn(' Falling back to synthetic audio test...');
122+
console.warn();
123+
124+
// Fallback: push 2 seconds of synthetic PCM (440Hz sine wave)
125+
const sampleRate = session.settings.sampleRate;
126+
const duration = 2;
127+
const totalSamples = sampleRate * duration;
128+
const pcmBytes = new Uint8Array(totalSamples * 2);
129+
for (let i = 0; i < totalSamples; i++) {
130+
const t = i / sampleRate;
131+
const sample = Math.round(32767 * 0.5 * Math.sin(2 * Math.PI * 440 * t));
132+
pcmBytes[i * 2] = sample & 0xFF;
133+
pcmBytes[i * 2 + 1] = (sample >> 8) & 0xFF;
134+
}
135+
136+
// Push in 100ms chunks
137+
const chunkSize = (sampleRate / 10) * 2;
138+
for (let offset = 0; offset < pcmBytes.length; offset += chunkSize) {
139+
const len = Math.min(chunkSize, pcmBytes.length - offset);
140+
await session.append(pcmBytes.slice(offset, offset + len));
141+
}
142+
143+
console.log('✓ Synthetic audio pushed');
144+
}
145+
146+
// Handle graceful shutdown
147+
process.on('SIGINT', async () => {
148+
console.log('\n\nStopping...');
149+
if (audioInput) {
150+
audioInput.quit();
151+
}
152+
await session.stop();
153+
await readPromise;
154+
await model.unload();
155+
console.log('✓ Done');
156+
process.exit(0);
157+
});

sdk/js/src/detail/coreInterop.ts

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,16 @@ koffi.struct('ResponseBuffer', {
1919
ErrorLength: 'int32_t',
2020
});
2121

22+
// Extended request struct for binary data (audio streaming)
23+
koffi.struct('StreamingRequestBuffer', {
24+
Command: 'char*',
25+
CommandLength: 'int32_t',
26+
Data: 'char*', // JSON params
27+
DataLength: 'int32_t',
28+
BinaryData: 'void*', // raw PCM audio bytes
29+
BinaryDataLength: 'int32_t',
30+
});
31+
2232
const CallbackType = koffi.proto('void CallbackType(void *data, int32_t length, void *userData)');
2333

2434
const __filename = fileURLToPath(import.meta.url);
@@ -28,6 +38,7 @@ export class CoreInterop {
2838
private lib: any;
2939
private execute_command: any;
3040
private execute_command_with_callback: any;
41+
private execute_command_with_binary: any = null;
3142

3243
private static _getLibraryExtension(): string {
3344
const platform = process.platform;
@@ -93,6 +104,7 @@ export class CoreInterop {
93104

94105
this.execute_command = this.lib.func('void execute_command(RequestBuffer *request, _Inout_ ResponseBuffer *response)');
95106
this.execute_command_with_callback = this.lib.func('void execute_command_with_callback(RequestBuffer *request, _Inout_ ResponseBuffer *response, CallbackType *callback, void *userData)');
107+
this.execute_command_with_binary = this.lib.func('void execute_command_with_binary(StreamingRequestBuffer *request, _Inout_ ResponseBuffer *response)');
96108
}
97109

98110
public executeCommand(command: string, params?: any): string {
@@ -129,6 +141,53 @@ export class CoreInterop {
129141
}
130142
}
131143

144+
/**
145+
* Execute a native command with binary data (e.g., audio PCM bytes).
146+
* Uses the execute_command_with_binary native entry point which accepts
147+
* both JSON params and raw binary data via StreamingRequestBuffer.
148+
*/
149+
public executeCommandWithBinary(command: string, params: any, binaryData: Uint8Array): string {
150+
const cmdBuf = koffi.alloc('char', command.length + 1);
151+
koffi.encode(cmdBuf, 'char', command, command.length + 1);
152+
153+
const dataStr = params ? JSON.stringify(params) : '';
154+
const dataBytes = this._toBytes(dataStr);
155+
const dataBuf = koffi.alloc('char', dataBytes.length + 1);
156+
koffi.encode(dataBuf, 'char', dataStr, dataBytes.length + 1);
157+
158+
// For binary data, use a Node.js Buffer which allocates stable external memory
159+
// that won't be moved by V8's garbage collector during the FFI call.
160+
const binLength = binaryData.length;
161+
const binBuf = Buffer.from(binaryData);
162+
163+
// Use koffi.as to pass Buffer directly as a typed pointer
164+
const binTypedPtr = koffi.as(binBuf, 'void *');
165+
166+
const req = {
167+
Command: koffi.address(cmdBuf),
168+
CommandLength: command.length,
169+
Data: koffi.address(dataBuf),
170+
DataLength: dataBytes.length,
171+
BinaryData: binTypedPtr,
172+
BinaryDataLength: binLength
173+
};
174+
const res = { Data: 0, DataLength: 0, Error: 0, ErrorLength: 0 };
175+
176+
this.execute_command_with_binary(req, res);
177+
178+
try {
179+
if (res.Error) {
180+
const errorMsg = koffi.decode(res.Error, 'char', res.ErrorLength);
181+
throw new Error(`Command '${command}' failed: ${errorMsg}`);
182+
}
183+
184+
return res.Data ? koffi.decode(res.Data, 'char', res.DataLength) : "";
185+
} finally {
186+
if (res.Data) koffi.free(res.Data);
187+
if (res.Error) koffi.free(res.Error);
188+
}
189+
}
190+
132191
public executeCommandStreaming(command: string, params: any, callback: (chunk: string) => void): Promise<void> {
133192
const cmdBuf = koffi.alloc('char', command.length + 1);
134193
koffi.encode(cmdBuf, 'char', command, command.length + 1);

sdk/js/src/imodel.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import { ChatClient } from './openai/chatClient.js';
22
import { AudioClient } from './openai/audioClient.js';
3+
import { LiveAudioTranscriptionSession } from './openai/liveAudioTranscriptionClient.js';
34
import { ResponsesClient } from './openai/responsesClient.js';
45

56
export interface IModel {
@@ -22,6 +23,13 @@ export interface IModel {
2223

2324
createChatClient(): ChatClient;
2425
createAudioClient(): AudioClient;
26+
27+
/**
28+
* Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
29+
* The model must be loaded before calling this method.
30+
* @returns A LiveAudioTranscriptionSession instance.
31+
*/
32+
createLiveTranscriptionSession(): LiveAudioTranscriptionSession;
2533
/**
2634
* Creates a ResponsesClient for interacting with the model via the Responses API.
2735
* Unlike createChatClient/createAudioClient (which use FFI), the Responses API

sdk/js/src/index.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ export { ModelVariant } from './modelVariant.js';
66
export type { IModel } from './imodel.js';
77
export { ChatClient, ChatClientSettings } from './openai/chatClient.js';
88
export { AudioClient, AudioClientSettings } from './openai/audioClient.js';
9+
export { LiveAudioTranscriptionSession, LiveAudioTranscriptionOptions } from './openai/liveAudioTranscriptionClient.js';
10+
export type { LiveAudioTranscriptionResponse, TranscriptionContentPart } from './openai/liveAudioTranscriptionTypes.js';
911
export { ResponsesClient, ResponsesClientSettings, getOutputText } from './openai/responsesClient.js';
1012
export { ModelLoadManager } from './detail/modelLoadManager.js';
1113
/** @internal */

sdk/js/src/model.ts

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import { ModelVariant } from './modelVariant.js';
22
import { ChatClient } from './openai/chatClient.js';
33
import { AudioClient } from './openai/audioClient.js';
4+
import { LiveAudioTranscriptionSession } from './openai/liveAudioTranscriptionClient.js';
45
import { ResponsesClient } from './openai/responsesClient.js';
56
import { IModel } from './imodel.js';
67

@@ -179,6 +180,14 @@ export class Model implements IModel {
179180
return this.selectedVariant.createAudioClient();
180181
}
181182

183+
/**
184+
* Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
185+
* @returns A LiveAudioTranscriptionSession instance.
186+
*/
187+
public createLiveTranscriptionSession(): LiveAudioTranscriptionSession {
188+
return this.selectedVariant.createLiveTranscriptionSession();
189+
}
190+
182191
/**
183192
* Creates a ResponsesClient for interacting with the model via the Responses API.
184193
* @param baseUrl - The base URL of the Foundry Local web service.

sdk/js/src/modelVariant.ts

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ import { ModelLoadManager } from './detail/modelLoadManager.js';
33
import { ModelInfo } from './types.js';
44
import { ChatClient } from './openai/chatClient.js';
55
import { AudioClient } from './openai/audioClient.js';
6+
import { LiveAudioTranscriptionSession } from './openai/liveAudioTranscriptionClient.js';
67
import { ResponsesClient } from './openai/responsesClient.js';
78
import { IModel } from './imodel.js';
89

@@ -149,6 +150,14 @@ export class ModelVariant implements IModel {
149150
return new AudioClient(this._modelInfo.id, this.coreInterop);
150151
}
151152

153+
/**
154+
* Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
155+
* @returns A LiveAudioTranscriptionSession instance.
156+
*/
157+
public createLiveTranscriptionSession(): LiveAudioTranscriptionSession {
158+
return new LiveAudioTranscriptionSession(this._modelInfo.id, this.coreInterop);
159+
}
160+
152161
/**
153162
* Creates a ResponsesClient for interacting with the model via the Responses API.
154163
* @param baseUrl - The base URL of the Foundry Local web service.

0 commit comments

Comments
 (0)