Skip to content

Commit d1a9e3c

Browse files
rui-renruiren_microsoftCopilotkunal-vaishnavi
authored
Add live audio transcription streaming support to Foundry Local C# SDK (#485)
Here's the cleaned version: --- ### Description: Adds real-time audio streaming support to the Foundry Local C# SDK, enabling live microphone-to-text transcription via ONNX Runtime GenAI's StreamingProcessor API (Nemotron ASR). The existing `OpenAIAudioClient` only supports file-based transcription. This PR introduces `LiveAudioTranscriptionSession` that accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as an async stream. ### What's included **New files** - `src/OpenAI/LiveAudioTranscriptionClient.cs` — Streaming session with `StartAsync()`, `AppendAsync()`, `GetTranscriptionStream()`, `StopAsync()` - `src/OpenAI/LiveAudioTranscriptionTypes.cs` — `LiveAudioTranscriptionResponse` (extends `AudioCreateTranscriptionResponse`) and `CoreErrorResponse` types - `test/FoundryLocal.Tests/LiveAudioTranscriptionTests.cs` — Unit tests for deserialization, settings, state guards **Modified files** - `src/OpenAI/AudioClient.cs` — Added `CreateLiveTranscriptionSession()` factory method - `src/Detail/ICoreInterop.cs` — Added `StreamingRequestBuffer` struct, `StartAudioStream`, `PushAudioData`, `StopAudioStream` interface methods - `src/Detail/CoreInterop.cs` — Routes audio commands through existing `execute_command` / `execute_command_with_binary` native entry points - `src/Detail/JsonSerializationContext.cs` — Registered `LiveAudioTranscriptionResponse` for AOT compatibility - README.md — Added live audio transcription documentation ### API surface ```csharp var audioClient = await model.GetAudioClientAsync(); var session = audioClient.CreateLiveTranscriptionSession(); session.Settings.SampleRate = 16000; session.Settings.Channels = 1; session.Settings.Language = "en"; await session.StartAsync(); // Push audio from microphone callback await session.AppendAsync(pcmBytes); // Read results as async stream await foreach (var result in session.GetTranscriptionStream()) { Console.Write(result.Text); } await session.StopAsync(); ``` ### Design highlights - **Output type alignment** — `LiveAudioTranscriptionResponse` extends `AudioCreateTranscriptionResponse` for consistent output format with file-based transcription - **Internal push queue** — Bounded `Channel<T>` serializes audio pushes from any thread (safe for mic callbacks) with backpressure - **Fail-fast on errors** — Push loop terminates immediately on any native error (no retry logic) - **Settings freeze** — Audio format settings are snapshot-copied at `StartAsync()` and immutable during the session - **Cancellation-safe stop** — `StopAsync` always calls native stop even if cancelled, preventing native session leaks - **Dedicated session CTS** — Push loop uses its own `CancellationTokenSource`, decoupled from the caller's token - **Routes through existing exports** — `StartAudioStream` and `StopAudioStream` route through `execute_command`; `PushAudioData` routes through `execute_command_with_binary` — no new native entry points required ### Core integration (neutron-server) The Core side (AudioStreamingSession.cs) uses `StreamingProcessor` + `Generator` + `Tokenizer` + `TokenizerStream` from onnxruntime-genai to perform real-time RNNT decoding. The native commands (`audio_stream_start`/`push`/`stop`) are handled as cases in `NativeInterop.ExecuteCommandManaged` / `ExecuteCommandWithBinaryManaged`. ### Verified working - ✅ SDK build succeeds (0 errors, 0 warnings) - ✅ Unit tests for JSON deserialization, type inheritance, settings, state guards - ✅ GenAI `StreamingProcessor` pipeline verified with WAV file (correct transcript) - ✅ Core `TranscribeChunk` byte[] PCM path matches reference float[] path exactly - ✅ Full E2E simulation: SDK Channel + JSON serialization + session management - ✅ Live microphone test: real-time transcription through SDK → Core → GenAI --------- Co-authored-by: ruiren_microsoft <ruiren@microsoft.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
1 parent 9b576e3 commit d1a9e3c

14 files changed

Lines changed: 1134 additions & 5 deletions

File tree

samples/cs/GettingStarted/Directory.Packages.props

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<Project>
22
<PropertyGroup>
33
<ManagePackageVersionsCentrally>true</ManagePackageVersionsCentrally>
4-
<OnnxRuntimeGenAIVersion>0.12.1</OnnxRuntimeGenAIVersion>
4+
<OnnxRuntimeGenAIVersion>0.13.0-dev-20260319-1131106-439ca0d51</OnnxRuntimeGenAIVersion>
55
<OnnxRuntimeVersion>1.23.2</OnnxRuntimeVersion>
66
</PropertyGroup>
77
<ItemGroup>
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<Project Sdk="Microsoft.NET.Sdk">
2+
3+
<PropertyGroup>
4+
<OutputType>Exe</OutputType>
5+
<TargetFramework>net9.0</TargetFramework>
6+
<ImplicitUsings>enable</ImplicitUsings>
7+
<Nullable>enable</Nullable>
8+
</PropertyGroup>
9+
10+
<PropertyGroup Condition="'$(RuntimeIdentifier)'==''">
11+
<RuntimeIdentifier>$(NETCoreSdkRuntimeIdentifier)</RuntimeIdentifier>
12+
</PropertyGroup>
13+
14+
<!-- Include the main program -->
15+
<ItemGroup>
16+
<Compile Include="../../src/LiveAudioTranscriptionExample/*.cs" />
17+
<Compile Include="../../src/Shared/*.cs" />
18+
</ItemGroup>
19+
20+
<!-- Packages -->
21+
<ItemGroup>
22+
<PackageReference Include="Microsoft.AI.Foundry.Local" />
23+
<PackageReference Include="NAudio" Version="2.2.1" />
24+
</ItemGroup>
25+
26+
<!-- ONNX Runtime GPU and CUDA provider (required for Linux)-->
27+
<ItemGroup Condition="'$(RuntimeIdentifier)' == 'linux-x64'">
28+
<PackageReference Include="Microsoft.ML.OnnxRuntime.Gpu" />
29+
<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.Cuda" />
30+
</ItemGroup>
31+
32+
</Project>
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
// Live Audio Transcription — Foundry Local SDK Example
2+
//
3+
// Demonstrates real-time microphone-to-text using:
4+
// SDK (FoundryLocalManager) → Core (NativeAOT DLL) → onnxruntime-genai (StreamingProcessor)
5+
6+
using Microsoft.AI.Foundry.Local;
7+
using NAudio.Wave;
8+
9+
Console.WriteLine("===========================================================");
10+
Console.WriteLine(" Foundry Local -- Live Audio Transcription Demo");
11+
Console.WriteLine("===========================================================");
12+
Console.WriteLine();
13+
14+
var config = new Configuration
15+
{
16+
AppName = "foundry_local_samples",
17+
LogLevel = Microsoft.AI.Foundry.Local.LogLevel.Information
18+
};
19+
20+
await FoundryLocalManager.CreateAsync(config, Utils.GetAppLogger());
21+
var mgr = FoundryLocalManager.Instance;
22+
23+
await Utils.RunWithSpinner("Registering execution providers", mgr.EnsureEpsDownloadedAsync());
24+
25+
var catalog = await mgr.GetCatalogAsync();
26+
27+
var model = await catalog.GetModelAsync("nemotron") ?? throw new Exception("Model \"nemotron\" not found in catalog");
28+
29+
await model.DownloadAsync(progress =>
30+
{
31+
Console.Write($"\rDownloading model: {progress:F2}%");
32+
if (progress >= 100f)
33+
{
34+
Console.WriteLine();
35+
}
36+
});
37+
38+
Console.Write($"Loading model {model.Id}...");
39+
await model.LoadAsync();
40+
Console.WriteLine("done.");
41+
42+
var audioClient = await model.GetAudioClientAsync();
43+
var session = audioClient.CreateLiveTranscriptionSession();
44+
session.Settings.SampleRate = 16000; // Default is 16000; shown here to match the NAudio WaveFormat below
45+
session.Settings.Channels = 1;
46+
session.Settings.Language = "en";
47+
48+
await session.StartAsync();
49+
Console.WriteLine(" Session started");
50+
51+
var readTask = Task.Run(async () =>
52+
{
53+
try
54+
{
55+
await foreach (var result in session.GetTranscriptionStream())
56+
{
57+
var text = result.Content?[0]?.Text;
58+
if (result.IsFinal)
59+
{
60+
Console.WriteLine();
61+
Console.WriteLine($" [FINAL] {text}");
62+
Console.Out.Flush();
63+
}
64+
else if (!string.IsNullOrEmpty(text))
65+
{
66+
Console.ForegroundColor = ConsoleColor.Cyan;
67+
Console.Write(text);
68+
Console.ResetColor();
69+
Console.Out.Flush();
70+
}
71+
}
72+
}
73+
catch (OperationCanceledException) { }
74+
});
75+
76+
using var waveIn = new WaveInEvent
77+
{
78+
WaveFormat = new WaveFormat(rate: 16000, bits: 16, channels: 1),
79+
BufferMilliseconds = 100
80+
};
81+
82+
waveIn.DataAvailable += (sender, e) =>
83+
{
84+
if (e.BytesRecorded > 0)
85+
{
86+
_ = session.AppendAsync(new ReadOnlyMemory<byte>(e.Buffer, 0, e.BytesRecorded));
87+
}
88+
};
89+
90+
Console.WriteLine();
91+
Console.WriteLine("===========================================================");
92+
Console.WriteLine(" LIVE TRANSCRIPTION ACTIVE");
93+
Console.WriteLine(" Speak into your microphone.");
94+
Console.WriteLine(" Transcription appears in real-time (cyan text).");
95+
Console.WriteLine(" Press ENTER to stop recording.");
96+
Console.WriteLine("===========================================================");
97+
Console.WriteLine();
98+
99+
waveIn.StartRecording();
100+
Console.ReadLine();
101+
waveIn.StopRecording();
102+
103+
await session.StopAsync();
104+
await readTask;
105+
106+
await model.UnloadAsync();
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<Project Sdk="Microsoft.NET.Sdk">
2+
3+
<PropertyGroup>
4+
<OutputType>Exe</OutputType>
5+
<ImplicitUsings>enable</ImplicitUsings>
6+
<Nullable>enable</Nullable>
7+
<!-- For Windows use the following -->
8+
<TargetFramework>net9.0-windows10.0.26100</TargetFramework>
9+
<WindowsAppSDKSelfContained>false</WindowsAppSDKSelfContained>
10+
<Platforms>ARM64;x64</Platforms>
11+
<WindowsPackageType>None</WindowsPackageType>
12+
<EnableCoreMrtTooling>false</EnableCoreMrtTooling>
13+
</PropertyGroup>
14+
15+
<PropertyGroup Condition="'$(RuntimeIdentifier)'==''">
16+
<RuntimeIdentifier>$(NETCoreSdkRuntimeIdentifier)</RuntimeIdentifier>
17+
</PropertyGroup>
18+
19+
<ItemGroup>
20+
<Compile Include="../../src/LiveAudioTranscriptionExample/*.cs" />
21+
<Compile Include="../../src/Shared/*.cs" />
22+
</ItemGroup>
23+
24+
<!-- Use WinML package for local Foundry SDK on Windows -->
25+
<ItemGroup>
26+
<PackageReference Include="Microsoft.AI.Foundry.Local.WinML" />
27+
<PackageReference Include="NAudio" Version="2.2.1" />
28+
</ItemGroup>
29+
30+
</Project>

sdk/cs/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,63 @@ audioClient.Settings.Language = "en";
233233
audioClient.Settings.Temperature = 0.0f;
234234
```
235235

236+
### Live Audio Transcription (Real-Time Streaming)
237+
238+
For real-time microphone-to-text transcription, use `CreateLiveTranscriptionSession()`. Audio is pushed as raw PCM chunks and transcription results stream back as an `IAsyncEnumerable`.
239+
240+
The streaming result type (`LiveAudioTranscriptionResponse`) extends `ConversationItem` from the Betalgo OpenAI SDK's Realtime models, so it's compatible with the OpenAI Realtime API pattern. Access transcribed text via `result.Content[0].Text` or `result.Content[0].Transcript`.
241+
242+
```csharp
243+
var audioClient = await model.GetAudioClientAsync();
244+
var session = audioClient.CreateLiveTranscriptionSession();
245+
246+
// Configure audio format (must be set before StartAsync)
247+
session.Settings.SampleRate = 16000;
248+
session.Settings.Channels = 1;
249+
session.Settings.Language = "en";
250+
251+
await session.StartAsync();
252+
253+
// Push audio from a microphone callback (thread-safe)
254+
waveIn.DataAvailable += (sender, e) =>
255+
{
256+
_ = session.AppendAsync(new ReadOnlyMemory<byte>(e.Buffer, 0, e.BytesRecorded));
257+
};
258+
259+
// Read transcription results as they arrive
260+
await foreach (var result in session.GetTranscriptionStream())
261+
{
262+
// result follows the OpenAI Realtime ConversationItem pattern:
263+
// - result.Content[0].Text — incremental transcribed text (per chunk, not accumulated)
264+
// - result.Content[0].Transcript — alias for Text (OpenAI Realtime compatibility)
265+
// - result.IsFinal — true for final results, false for interim hypotheses
266+
// - result.StartTime / EndTime — segment timing in seconds
267+
Console.Write(result.Content?[0]?.Text);
268+
}
269+
270+
await session.StopAsync();
271+
```
272+
273+
#### Output Type
274+
275+
| Field | Type | Description |
276+
|-------|------|-------------|
277+
| `Content` | `List<TranscriptionContentPart>` | Content parts. Access text via `Content[0].Text` or `Content[0].Transcript`. |
278+
| `IsFinal` | `bool` | Whether this is a final or interim result. Nemotron always returns `true`. |
279+
| `StartTime` | `double?` | Start time offset in the audio stream (seconds). |
280+
| `EndTime` | `double?` | End time offset in the audio stream (seconds). |
281+
| `Id` | `string?` | Unique identifier for this result (if available). |
282+
283+
#### Session Lifecycle
284+
285+
| Method | Description |
286+
|--------|-------------|
287+
| `StartAsync()` | Initialize the streaming session. Settings are frozen after this call. |
288+
| `AppendAsync(pcmData)` | Push a chunk of raw PCM audio. Thread-safe (bounded internal queue). |
289+
| `GetTranscriptionStream()` | Async enumerable of transcription results. |
290+
| `StopAsync()` | Signal end-of-audio, flush remaining audio, and clean up. |
291+
| `DisposeAsync()` | Calls `StopAsync` if needed. Use `await using` for automatic cleanup. |
292+
236293
### Web Service
237294

238295
Start an OpenAI-compatible REST endpoint for use by external tools or processes:
@@ -297,6 +354,8 @@ Key types:
297354
| [`ModelVariant`](./docs/api/microsoft.ai.foundry.local.modelvariant.md) | Specific model variant (hardware/quantization) |
298355
| [`OpenAIChatClient`](./docs/api/microsoft.ai.foundry.local.openaichatclient.md) | Chat completions (sync + streaming) |
299356
| [`OpenAIAudioClient`](./docs/api/microsoft.ai.foundry.local.openaiaudioclient.md) | Audio transcription (sync + streaming) |
357+
| [`LiveAudioTranscriptionSession`](./docs/api/microsoft.ai.foundry.local.openai.liveaudiotranscriptionsession.md) | Real-time audio streaming session |
358+
| [`LiveAudioTranscriptionResponse`](./docs/api/microsoft.ai.foundry.local.openai.liveaudiotranscriptionresponse.md) | Streaming transcription result (ConversationItem-shaped) |
300359
| [`ModelInfo`](./docs/api/microsoft.ai.foundry.local.modelinfo.md) | Full model metadata record |
301360

302361
## Tests

sdk/cs/src/Detail/CoreInterop.cs

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,31 @@ private static unsafe partial void CoreExecuteCommandWithCallback(RequestBuffer*
158158
nint callbackPtr, // NativeCallbackFn pointer
159159
nint userData);
160160

161+
[LibraryImport(LibraryName, EntryPoint = "execute_command_with_binary")]
162+
[UnmanagedCallConv(CallConvs = new[] { typeof(System.Runtime.CompilerServices.CallConvCdecl) })]
163+
private static unsafe partial void CoreExecuteCommandWithBinary(StreamingRequestBuffer* nativeRequest,
164+
ResponseBuffer* nativeResponse);
165+
166+
// --- Audio streaming P/Invoke imports (kept for future dedicated entry points) ---
167+
168+
[LibraryImport(LibraryName, EntryPoint = "audio_stream_start")]
169+
[UnmanagedCallConv(CallConvs = new[] { typeof(System.Runtime.CompilerServices.CallConvCdecl) })]
170+
private static unsafe partial void CoreAudioStreamStart(
171+
RequestBuffer* request,
172+
ResponseBuffer* response);
173+
174+
[LibraryImport(LibraryName, EntryPoint = "audio_stream_push")]
175+
[UnmanagedCallConv(CallConvs = new[] { typeof(System.Runtime.CompilerServices.CallConvCdecl) })]
176+
private static unsafe partial void CoreAudioStreamPush(
177+
StreamingRequestBuffer* request,
178+
ResponseBuffer* response);
179+
180+
[LibraryImport(LibraryName, EntryPoint = "audio_stream_stop")]
181+
[UnmanagedCallConv(CallConvs = new[] { typeof(System.Runtime.CompilerServices.CallConvCdecl) })]
182+
private static unsafe partial void CoreAudioStreamStop(
183+
RequestBuffer* request,
184+
ResponseBuffer* response);
185+
161186
// helper to capture exceptions in callbacks
162187
internal class CallbackHelper
163188
{
@@ -331,4 +356,94 @@ public Task<Response> ExecuteCommandWithCallbackAsync(string commandName, CoreIn
331356
return Task.Run(() => ExecuteCommandWithCallback(commandName, commandInput, callback), ct);
332357
}
333358

359+
/// <summary>
360+
/// Marshal a ResponseBuffer from unmanaged memory into a managed Response and free the unmanaged memory.
361+
/// </summary>
362+
private Response MarshalResponse(ResponseBuffer response)
363+
{
364+
Response result = new();
365+
366+
if (response.Data != IntPtr.Zero && response.DataLength > 0)
367+
{
368+
byte[] managedResponse = new byte[response.DataLength];
369+
Marshal.Copy(response.Data, managedResponse, 0, response.DataLength);
370+
result.Data = System.Text.Encoding.UTF8.GetString(managedResponse);
371+
}
372+
373+
if (response.Error != IntPtr.Zero && response.ErrorLength > 0)
374+
{
375+
result.Error = Marshal.PtrToStringUTF8(response.Error, response.ErrorLength)!;
376+
}
377+
378+
Marshal.FreeHGlobal(response.Data);
379+
Marshal.FreeHGlobal(response.Error);
380+
381+
return result;
382+
}
383+
384+
// --- Audio streaming managed implementations ---
385+
// Route through the existing execute_command / execute_command_with_binary entry points.
386+
// The Core handles audio_stream_start / audio_stream_stop as command cases in ExecuteCommandManaged,
387+
// and audio_stream_push as a command case in ExecuteCommandWithBinaryManaged.
388+
389+
public Response StartAudioStream(CoreInteropRequest request)
390+
{
391+
return ExecuteCommand("audio_stream_start", request);
392+
}
393+
394+
public Response PushAudioData(CoreInteropRequest request, ReadOnlyMemory<byte> audioData)
395+
{
396+
try
397+
{
398+
var commandInputJson = request.ToJson();
399+
byte[] commandBytes = System.Text.Encoding.UTF8.GetBytes("audio_stream_push");
400+
byte[] inputBytes = System.Text.Encoding.UTF8.GetBytes(commandInputJson);
401+
402+
IntPtr commandPtr = Marshal.AllocHGlobal(commandBytes.Length);
403+
Marshal.Copy(commandBytes, 0, commandPtr, commandBytes.Length);
404+
405+
IntPtr inputPtr = Marshal.AllocHGlobal(inputBytes.Length);
406+
Marshal.Copy(inputBytes, 0, inputPtr, inputBytes.Length);
407+
408+
// Pin the managed audio data so GC won't move it during the native call
409+
using var audioHandle = audioData.Pin();
410+
411+
unsafe
412+
{
413+
var reqBuf = new StreamingRequestBuffer
414+
{
415+
Command = commandPtr,
416+
CommandLength = commandBytes.Length,
417+
Data = inputPtr,
418+
DataLength = inputBytes.Length,
419+
BinaryData = (nint)audioHandle.Pointer,
420+
BinaryDataLength = audioData.Length
421+
};
422+
423+
ResponseBuffer response = default;
424+
425+
try
426+
{
427+
CoreExecuteCommandWithBinary(&reqBuf, &response);
428+
}
429+
finally
430+
{
431+
Marshal.FreeHGlobal(commandPtr);
432+
Marshal.FreeHGlobal(inputPtr);
433+
}
434+
435+
return MarshalResponse(response);
436+
}
437+
}
438+
catch (Exception ex) when (ex is not OperationCanceledException)
439+
{
440+
throw new FoundryLocalException("Error executing audio_stream_push", ex, _logger);
441+
}
442+
}
443+
444+
public Response StopAudioStream(CoreInteropRequest request)
445+
{
446+
return ExecuteCommand("audio_stream_stop", request);
447+
}
448+
334449
}

0 commit comments

Comments
 (0)