PocketTTS 2.0#390
Merged
Merged
Conversation
pocket-tts 2.0 switches to the language= / config= API and ships
HF-hosted weights per model. torchao is required for int8 quantization;
without it, the torch.ao fallback wraps attention in_proj with a bound
method instead of a tensor and breaks voice cloning
('function' object has no attribute 'device').
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…older
- PocketTTSSettings: replace custom_model_path with model (language id
or custom .yaml) and add quantize toggle.
- SettingsConfig: add spoken_language (default 'multilingual').
- defaults.yaml: replace 'Mirror the user's language' with the
{language_instruction} template placeholder; switch default
tts_provider to pocket_tts.
- settings.yaml: default model 'english_2026-04', quantize true.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- load_model() switches to TTSModel.load_model(language=/config=),
with quantize=True honored in both paths. Legacy English IDs
('english', 'english_2026-01') are silently aliased to
'english_2026-04' at runtime.
- BUILTIN_MODELS list with label==id so the model string is visible
in the UI and matches the generated safetensors cache filename.
- get_state_for_audio_prompt calls now pass truncate=True — v2
destabilizes on long prompts and produces near-instant EOS.
- Voice cache: LRU-bounded OrderedDict (32 entries) so long-running
sessions don't leak model-state tensors.
- Per-model safetensors tagging: cloned voice states are persisted
as <stem>.<model_id>.safetensors next to the source audio, so
switching model families doesn't destroy the previous cache.
Listing dedupes by stem so the same voice isn't shown twice.
- Pre-splash aware: defer_load flag + deferred_init() to let
wingman_core drive load during the splash-screen flow.
- Thread-safety: two-layer locking. _async_gen_lock serializes
coroutines; _model_swap_lock blocks the settings-reload daemon
thread while any executor or audio-callback thread is mid-
generation (TTSModel.generate_audio is not thread-safe per v2 docs).
- frames_after_eos defaults to None so pocket-tts auto-picks
1-3 trailing frames based on text length instead of our old 3.
- preload_voice_states() warms the cache for a set of voice IDs
with a progress callback; skips builtins (pocket-tts resolves
them from HF cache lazily anyway).
- on_model_reloaded callback fires after successful load so
wingman_core can schedule a preload for the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services/voice_preprocessing.py: pure-Python pipeline (soundfile + scipy, no ffmpeg) that takes arbitrary wav/mp3/flac/ogg/aiff uploads and produces the format PocketTTS clones best from: - mono downmix - resample to 24 kHz (model sample rate) - RMS-gated silence trim on both ends - peak-normalize to -1 dBFS - cap to 20 s (v2 destabilizes on long prompts) - write as int16 mono WAV Raises ValueError with a user-friendly message when the clip is too short after trimming. supported_input_extensions() exposes the decoder set for the endpoint layer. services/file.py: add get_pocket_tts_models_dir() helper. Custom YAML model configs live there and persist across app updates, matching the pattern used by custom_voices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration changes:
- pocket_tts: translate language+high_quality to the new model string,
upgrade legacy English IDs ('english', 'english_2026-01') to the
pinned 'english_2026-04', drop custom_model_path, default
quantize=true.
- settings: add spoken_language='multilingual' for upgrading users.
- features.tts_provider: wingman_pro → pocket_tts for default and
per-wingman configs. Existing Wingman Pro voice settings are left
in place so users can flip back.
- prompts.system_prompt: reset to the shipped default so everyone
picks up the new {language_instruction} placeholder. Only applied
when the wingman/defaults config actually has a system_prompt set
(no new override injected).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services/settings_service.py:
- SPOKEN_TO_POCKET_TTS maps each supported language to its canonical
pocket-tts model ID (en→english_2026-04, de→german, fr→french_24l
because French has no 6L variant, etc.).
- On spoken_language change in save_settings(), cascade to:
- pocket_tts.model (triggers PocketTTS reload)
- voice_activation.fasterwhisper_config.language (None for
'multilingual' to let FasterWhisper auto-detect)
Parakeet has no language parameter and auto-detects.
services/context_builder.py:
Resolve the {language_instruction} placeholder just before formatting
the system prompt. Multilingual → 'Respond in whatever language the
user speaks to you'. Specific language → 'Always respond in <name>'.
prompts/tts-test-praise.md: deleted — obsolete TTS test prompt that's
no longer referenced after the OpenAI-compatible TTS test flow was
reworked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Capture main event loop at startup() so background threads (model reload) can schedule coroutines. - Step 4 of startup() now drives pocket_tts.deferred_init() via run_in_executor and broadcasts 'Downloading TTS model...' during the load. PocketTTS is instantiated with defer_load=True. - on_model_reloaded callback: after a settings-triggered model reload, schedule _preload_pocket_tts_voices() on the main loop via run_coroutine_threadsafe and log any background failure via add_done_callback so errors are surfaced. - _preload_pocket_tts_voices(): iterates unique voices in the current tower (or a caller-supplied list), broadcasts LOADING_CONFIG with per-voice progress so the existing header bar indicator shows the work. Optional restore_ready_state=True flips back to READY when done — used for per-voice preloads triggered from the config UI. - Called from initialize_tower() so switching configs frontloads the first-clone latency instead of paying it on the first TTS utterance. New endpoints: - POST /pocket_tts/preload_voice?voice=<id>: preload a single voice on demand (client calls this when user picks a voice); header bar reflects progress and restores READY on completion. - POST /voices/preprocess: multipart upload, cleans the audio with services.voice_preprocessing, writes the result to custom_voices/ as <name>.wav, wipes any stale per-model safetensors cache for that stem, and invalidates just the matching voice_cache entries. Voice-name validation (_validate_voice_stem): rejects empty / path- separator / dot-traversal / control char / Windows reserved (CON, PRN, AUX, NUL, COM1-9, LPT1-9) / >80 char names, and stems that shadow a known model tag (would collide with <voice>.<model_id>.safetensors). Post-join commonpath assertion guarantees dst_path lives inside custom_voices_dir. Upload uses tempfile.mkstemp for a unique temp path (PID-based naming collided across concurrent requests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v2 resolves built-in voices ('alba', 'cosette', ...) from HuggingFace
on first use, so the old bundled pocket-tts-models/embeddings/ and
pocket-tts-voices/ directories are obsolete. Removed their lookup code:
- POCKET_TTS_VOICES_DIR / INCLUDED_VOICES_DIR constants
- _get_pocket_tts_included_voices_dir(), _get_wingman_included_voices_dir()
- _get_app_dir() (no remaining callers)
- self.wingman_included_voices_dir attribute + its directory scan
in get_available_voices
- sys import (was only used by the _MEIPASS lookup in _get_app_dir)
_resolve_voice_path now only checks the custom_voices directory; bare
predefined names pass straight through to pocket-tts / HF.
WingmanAiCore.spec: collect_all('torchao') so PyInstaller picks up
torchao.dtypes / torchao.kernel / torchao.quantization submodules
that load dynamically. Without this the bundled binary can import
torchao but fail to initialize the modern int8 backend, falling back
to the broken torch.ao path at runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously _initialize_fasterwhisper just called WhisperModel(model_size),
which silently downloaded the ~1.5 GB CTranslate2 model via faster_whisper's
internal tqdm — no feedback to the UI. Users would sit on
'Initializing speech-to-text (FasterWhisper)...' for minutes with no sign
of progress.
Now mirrors the Parakeet flow:
- resolve_faster_whisper_repo() maps a model_size ('distil-large-v3',
'large-v3', etc.) to its HuggingFace repo ID via FASTER_WHISPER_REPO_MAP
(a static mirror of faster_whisper.utils._MODELS). Custom repo paths
like 'Systran/faster-whisper-large-v3' are passed through.
- _initialize_fasterwhisper() downloads via ModelDownloader.download_huggingface
into <models>/faster-whisper/<model_size>/ with the standard
whisper allow_patterns before handing off to WhisperModel. The
existing FasterWhisper.load() lookup picks up the local dir.
- on_status callbacks broadcast 'Downloading STT model (FasterWhisper: ...)'
and 'Initializing speech-to-text...' just like Parakeet.
Also plumbs on_status through switch_provider so runtime settings changes
(model_size / device / compute_type tweaks, provider flips) surface status
too, not just initial startup. WingmanCore injects two callbacks into
SettingsService — stt_status_callback to emit LOADING_CONFIG updates and
stt_done_callback to flip back to READY in a finally — so save_settings
can drive the shared header-bar indicator without needing a direct
reference to set_core_state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds PocketTTSPreloadResult Pydantic model so the OpenAPI client generates a typed method instead of an untyped dict return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Upgrades the local TTS stack to PocketTTS 2.0 (multilingual + model selection/quantization) and adds supporting runtime APIs (model status, voice preloading/precompute, voice preprocessing), while also improving STT model download/init UX and introducing a global spoken_language setting that feeds into prompt formatting and STT/TTS defaults.
Changes:
- Migrate PocketTTS integration to v2.0 with model IDs, quantization (torchao), LRU voice-state caching, and background (pre)compute/preload flows.
- Add new core API routes for PocketTTS status/models, voice preloading/precompute, and an upload-based voice preprocessing pipeline.
- Add
spoken_languagesetting + migration and inject{language_instruction}into the default system prompt; set PocketTTS as the default TTS provider.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| WingmanAiCore.spec | Includes torchao artifacts in PyInstaller build to support PocketTTS int8 quantization. |
| wingman_core.py | Adds PocketTTS status/models/preload/precompute/preprocess endpoints; defers PocketTTS init; warms voice cache on tower init. |
| templates/configs/settings.yaml | Updates PocketTTS defaults (model + quantize) and adds spoken_language. |
| templates/configs/defaults.yaml | Injects {language_instruction} and switches default tts_provider to pocket_tts. |
| services/voice_preprocessing.py | New audio preprocessing utility used for voice cloning uploads. |
| services/stt_provider_manager.py | Downloads FasterWhisper models via ModelDownloader with UI status callbacks; status propagation on provider switches. |
| services/settings_service.py | Adds spoken_language cascade logic and status callbacks for runtime STT provider switching. |
| services/migrations/migration_311_to_312.py | Migrates PocketTTS settings to new model/quantize fields, adds spoken_language, resets system prompt, switches default TTS provider. |
| services/file.py | Adds get_pocket_tts_models_dir() helper. |
| services/context_builder.py | Provides {language_instruction} substitution based on spoken_language. |
| services/audio_player.py | Best-effort saving of generated preview audio to disk for later use. |
| requirements.txt | Bumps pocket-tts to 2.0.0 and adds torchao. |
| providers/pocket_tts.py | Major refactor for PocketTTS v2: model IDs, custom model discovery, per-model cache tagging, precompute/preload/status, and concurrency locks. |
| providers/parakeet.py | Adds language cascade behavior (wingman override → global setting). |
| api/interface.py | Updates PocketTTS settings schema and adds spoken_language + preload response model. |
Comments suppressed due to low confidence (1)
providers/pocket_tts.py:286
unload_model()mutatesself.modeland clearsvoice_cachewithout acquiring_model_swap_lock. Because synthesis runs in a worker thread under_model_swap_lock, a concurrent call tounload_model()(e.g., via the stop endpoint or a settings change) can delete/NULL the model mid-generation and crash. Makeload_model()/unload_model()take_model_swap_lockinternally (or provide a single public ‘swap’ method that always holds the lock) so all entrypoints are safe.
def unload_model(self):
"""Unload the model to free resources."""
if self.model:
del self.model
self.model = None
# Explicitly clear CUDA cache if using GPU to free GPU memory
if torch.cuda.is_available():
torch.cuda.empty_cache()
self.voice_cache.clear()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Shackless
added a commit
that referenced
this pull request
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fixes #377 #389