PocketTTS 2.0 by Shackless · Pull Request #390 · ShipBit/wingman-ai

Shackless · 2026-04-23T19:39:51Z

pocket-tts 2.0 switches to the language= / config= API and ships HF-hosted weights per model. torchao is required for int8 quantization; without it, the torch.ao fallback wraps attention in_proj with a bound method instead of a tensor and breaks voice cloning ('function' object has no attribute 'device'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…older - PocketTTSSettings: replace custom_model_path with model (language id or custom .yaml) and add quantize toggle. - SettingsConfig: add spoken_language (default 'multilingual'). - defaults.yaml: replace 'Mirror the user's language' with the {language_instruction} template placeholder; switch default tts_provider to pocket_tts. - settings.yaml: default model 'english_2026-04', quantize true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- load_model() switches to TTSModel.load_model(language=/config=), with quantize=True honored in both paths. Legacy English IDs ('english', 'english_2026-01') are silently aliased to 'english_2026-04' at runtime. - BUILTIN_MODELS list with label==id so the model string is visible in the UI and matches the generated safetensors cache filename. - get_state_for_audio_prompt calls now pass truncate=True — v2 destabilizes on long prompts and produces near-instant EOS. - Voice cache: LRU-bounded OrderedDict (32 entries) so long-running sessions don't leak model-state tensors. - Per-model safetensors tagging: cloned voice states are persisted as <stem>.<model_id>.safetensors next to the source audio, so switching model families doesn't destroy the previous cache. Listing dedupes by stem so the same voice isn't shown twice. - Pre-splash aware: defer_load flag + deferred_init() to let wingman_core drive load during the splash-screen flow. - Thread-safety: two-layer locking. _async_gen_lock serializes coroutines; _model_swap_lock blocks the settings-reload daemon thread while any executor or audio-callback thread is mid- generation (TTSModel.generate_audio is not thread-safe per v2 docs). - frames_after_eos defaults to None so pocket-tts auto-picks 1-3 trailing frames based on text length instead of our old 3. - preload_voice_states() warms the cache for a set of voice IDs with a progress callback; skips builtins (pocket-tts resolves them from HF cache lazily anyway). - on_model_reloaded callback fires after successful load so wingman_core can schedule a preload for the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

services/voice_preprocessing.py: pure-Python pipeline (soundfile + scipy, no ffmpeg) that takes arbitrary wav/mp3/flac/ogg/aiff uploads and produces the format PocketTTS clones best from: - mono downmix - resample to 24 kHz (model sample rate) - RMS-gated silence trim on both ends - peak-normalize to -1 dBFS - cap to 20 s (v2 destabilizes on long prompts) - write as int16 mono WAV Raises ValueError with a user-friendly message when the clip is too short after trimming. supported_input_extensions() exposes the decoder set for the endpoint layer. services/file.py: add get_pocket_tts_models_dir() helper. Custom YAML model configs live there and persist across app updates, matching the pattern used by custom_voices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migration changes: - pocket_tts: translate language+high_quality to the new model string, upgrade legacy English IDs ('english', 'english_2026-01') to the pinned 'english_2026-04', drop custom_model_path, default quantize=true. - settings: add spoken_language='multilingual' for upgrading users. - features.tts_provider: wingman_pro → pocket_tts for default and per-wingman configs. Existing Wingman Pro voice settings are left in place so users can flip back. - prompts.system_prompt: reset to the shipped default so everyone picks up the new {language_instruction} placeholder. Only applied when the wingman/defaults config actually has a system_prompt set (no new override injected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

services/settings_service.py: - SPOKEN_TO_POCKET_TTS maps each supported language to its canonical pocket-tts model ID (en→english_2026-04, de→german, fr→french_24l because French has no 6L variant, etc.). - On spoken_language change in save_settings(), cascade to: - pocket_tts.model (triggers PocketTTS reload) - voice_activation.fasterwhisper_config.language (None for 'multilingual' to let FasterWhisper auto-detect) Parakeet has no language parameter and auto-detects. services/context_builder.py: Resolve the {language_instruction} placeholder just before formatting the system prompt. Multilingual → 'Respond in whatever language the user speaks to you'. Specific language → 'Always respond in <name>'. prompts/tts-test-praise.md: deleted — obsolete TTS test prompt that's no longer referenced after the OpenAI-compatible TTS test flow was reworked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Capture main event loop at startup() so background threads (model reload) can schedule coroutines. - Step 4 of startup() now drives pocket_tts.deferred_init() via run_in_executor and broadcasts 'Downloading TTS model...' during the load. PocketTTS is instantiated with defer_load=True. - on_model_reloaded callback: after a settings-triggered model reload, schedule _preload_pocket_tts_voices() on the main loop via run_coroutine_threadsafe and log any background failure via add_done_callback so errors are surfaced. - _preload_pocket_tts_voices(): iterates unique voices in the current tower (or a caller-supplied list), broadcasts LOADING_CONFIG with per-voice progress so the existing header bar indicator shows the work. Optional restore_ready_state=True flips back to READY when done — used for per-voice preloads triggered from the config UI. - Called from initialize_tower() so switching configs frontloads the first-clone latency instead of paying it on the first TTS utterance. New endpoints: - POST /pocket_tts/preload_voice?voice=<id>: preload a single voice on demand (client calls this when user picks a voice); header bar reflects progress and restores READY on completion. - POST /voices/preprocess: multipart upload, cleans the audio with services.voice_preprocessing, writes the result to custom_voices/ as <name>.wav, wipes any stale per-model safetensors cache for that stem, and invalidates just the matching voice_cache entries. Voice-name validation (_validate_voice_stem): rejects empty / path- separator / dot-traversal / control char / Windows reserved (CON, PRN, AUX, NUL, COM1-9, LPT1-9) / >80 char names, and stems that shadow a known model tag (would collide with <voice>.<model_id>.safetensors). Post-join commonpath assertion guarantees dst_path lives inside custom_voices_dir. Upload uses tempfile.mkstemp for a unique temp path (PID-based naming collided across concurrent requests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v2 resolves built-in voices ('alba', 'cosette', ...) from HuggingFace on first use, so the old bundled pocket-tts-models/embeddings/ and pocket-tts-voices/ directories are obsolete. Removed their lookup code: - POCKET_TTS_VOICES_DIR / INCLUDED_VOICES_DIR constants - _get_pocket_tts_included_voices_dir(), _get_wingman_included_voices_dir() - _get_app_dir() (no remaining callers) - self.wingman_included_voices_dir attribute + its directory scan in get_available_voices - sys import (was only used by the _MEIPASS lookup in _get_app_dir) _resolve_voice_path now only checks the custom_voices directory; bare predefined names pass straight through to pocket-tts / HF. WingmanAiCore.spec: collect_all('torchao') so PyInstaller picks up torchao.dtypes / torchao.kernel / torchao.quantization submodules that load dynamically. Without this the bundled binary can import torchao but fail to initialize the modern int8 backend, falling back to the broken torch.ao path at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously _initialize_fasterwhisper just called WhisperModel(model_size), which silently downloaded the ~1.5 GB CTranslate2 model via faster_whisper's internal tqdm — no feedback to the UI. Users would sit on 'Initializing speech-to-text (FasterWhisper)...' for minutes with no sign of progress. Now mirrors the Parakeet flow: - resolve_faster_whisper_repo() maps a model_size ('distil-large-v3', 'large-v3', etc.) to its HuggingFace repo ID via FASTER_WHISPER_REPO_MAP (a static mirror of faster_whisper.utils._MODELS). Custom repo paths like 'Systran/faster-whisper-large-v3' are passed through. - _initialize_fasterwhisper() downloads via ModelDownloader.download_huggingface into <models>/faster-whisper/<model_size>/ with the standard whisper allow_patterns before handing off to WhisperModel. The existing FasterWhisper.load() lookup picks up the local dir. - on_status callbacks broadcast 'Downloading STT model (FasterWhisper: ...)' and 'Initializing speech-to-text...' just like Parakeet. Also plumbs on_status through switch_provider so runtime settings changes (model_size / device / compute_type tweaks, provider flips) surface status too, not just initial startup. WingmanCore injects two callbacks into SettingsService — stt_status_callback to emit LOADING_CONFIG updates and stt_done_callback to flip back to READY in a finally — so save_settings can drive the shared header-bar indicator without needing a direct reference to set_core_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds PocketTTSPreloadResult Pydantic model so the OpenAPI client generates a typed method instead of an untyped dict return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Upgrades the local TTS stack to PocketTTS 2.0 (multilingual + model selection/quantization) and adds supporting runtime APIs (model status, voice preloading/precompute, voice preprocessing), while also improving STT model download/init UX and introducing a global spoken_language setting that feeds into prompt formatting and STT/TTS defaults.

Changes:

Migrate PocketTTS integration to v2.0 with model IDs, quantization (torchao), LRU voice-state caching, and background (pre)compute/preload flows.
Add new core API routes for PocketTTS status/models, voice preloading/precompute, and an upload-based voice preprocessing pipeline.
Add spoken_language setting + migration and inject {language_instruction} into the default system prompt; set PocketTTS as the default TTS provider.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
WingmanAiCore.spec	Includes `torchao` artifacts in PyInstaller build to support PocketTTS int8 quantization.
wingman_core.py	Adds PocketTTS status/models/preload/precompute/preprocess endpoints; defers PocketTTS init; warms voice cache on tower init.
templates/configs/settings.yaml	Updates PocketTTS defaults (model + quantize) and adds `spoken_language`.
templates/configs/defaults.yaml	Injects `{language_instruction}` and switches default `tts_provider` to `pocket_tts`.
services/voice_preprocessing.py	New audio preprocessing utility used for voice cloning uploads.
services/stt_provider_manager.py	Downloads FasterWhisper models via `ModelDownloader` with UI status callbacks; status propagation on provider switches.
services/settings_service.py	Adds `spoken_language` cascade logic and status callbacks for runtime STT provider switching.
services/migrations/migration_311_to_312.py	Migrates PocketTTS settings to new model/quantize fields, adds `spoken_language`, resets system prompt, switches default TTS provider.
services/file.py	Adds `get_pocket_tts_models_dir()` helper.
services/context_builder.py	Provides `{language_instruction}` substitution based on `spoken_language`.
services/audio_player.py	Best-effort saving of generated preview audio to disk for later use.
requirements.txt	Bumps `pocket-tts` to 2.0.0 and adds `torchao`.
providers/pocket_tts.py	Major refactor for PocketTTS v2: model IDs, custom model discovery, per-model cache tagging, precompute/preload/status, and concurrency locks.
providers/parakeet.py	Adds language cascade behavior (wingman override → global setting).
api/interface.py	Updates PocketTTS settings schema and adds `spoken_language` + preload response model.

Comments suppressed due to low confidence (1)

providers/pocket_tts.py:286

unload_model() mutates self.model and clears voice_cache without acquiring _model_swap_lock. Because synthesis runs in a worker thread under _model_swap_lock, a concurrent call to unload_model() (e.g., via the stop endpoint or a settings change) can delete/NULL the model mid-generation and crash. Make load_model()/unload_model() take _model_swap_lock internally (or provide a single public ‘swap’ method that always holds the lock) so all entrypoints are safe.

    def unload_model(self):
        """Unload the model to free resources."""
        if self.model:
            del self.model
            self.model = None

        # Explicitly clear CUDA cache if using GPU to free GPU memory
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        self.voice_cache.clear()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fixes #377 #389 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shackless and others added 11 commits April 22, 2026 23:16

feat(pocket_tts): typed response model for /preload_voice

c42d611

Adds PocketTTSPreloadResult Pydantic model so the OpenAPI client generates a typed method instead of an untyped dict return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

parakeet fallback language in settings and some fixes

de597e7

Copilot AI review requested due to automatic review settings April 23, 2026 19:39

Copilot started reviewing on behalf of Shackless April 23, 2026 19:40 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread providers/pocket_tts.py

Comment thread providers/pocket_tts.py

Comment thread providers/pocket_tts.py

Comment thread wingman_core.py

Shackless merged commit af43500 into develop Apr 23, 2026
5 of 6 checks passed

Shackless deleted the feature/pockettts-v2 branch April 23, 2026 20:07

Shackless added a commit that referenced this pull request Apr 23, 2026

PocketTTS 2.0 (#390)

69bb745

fixes #377 #389 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PocketTTS 2.0#390

PocketTTS 2.0#390
Shackless merged 11 commits into
developfrom
feature/pockettts-v2

Shackless commented Apr 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

Shackless commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shackless commented Apr 23, 2026 •

edited

Loading