Skip to content

PocketTTS 2.0#390

Merged
Shackless merged 11 commits into
developfrom
feature/pockettts-v2
Apr 23, 2026
Merged

PocketTTS 2.0#390
Shackless merged 11 commits into
developfrom
feature/pockettts-v2

Conversation

@Shackless

@Shackless Shackless commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

fixes #377 #389

Shackless and others added 11 commits April 22, 2026 23:16
pocket-tts 2.0 switches to the language= / config= API and ships
HF-hosted weights per model. torchao is required for int8 quantization;
without it, the torch.ao fallback wraps attention in_proj with a bound
method instead of a tensor and breaks voice cloning
('function' object has no attribute 'device').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…older

- PocketTTSSettings: replace custom_model_path with model (language id
  or custom .yaml) and add quantize toggle.
- SettingsConfig: add spoken_language (default 'multilingual').
- defaults.yaml: replace 'Mirror the user's language' with the
  {language_instruction} template placeholder; switch default
  tts_provider to pocket_tts.
- settings.yaml: default model 'english_2026-04', quantize true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- load_model() switches to TTSModel.load_model(language=/config=),
  with quantize=True honored in both paths. Legacy English IDs
  ('english', 'english_2026-01') are silently aliased to
  'english_2026-04' at runtime.
- BUILTIN_MODELS list with label==id so the model string is visible
  in the UI and matches the generated safetensors cache filename.
- get_state_for_audio_prompt calls now pass truncate=True — v2
  destabilizes on long prompts and produces near-instant EOS.
- Voice cache: LRU-bounded OrderedDict (32 entries) so long-running
  sessions don't leak model-state tensors.
- Per-model safetensors tagging: cloned voice states are persisted
  as <stem>.<model_id>.safetensors next to the source audio, so
  switching model families doesn't destroy the previous cache.
  Listing dedupes by stem so the same voice isn't shown twice.
- Pre-splash aware: defer_load flag + deferred_init() to let
  wingman_core drive load during the splash-screen flow.
- Thread-safety: two-layer locking. _async_gen_lock serializes
  coroutines; _model_swap_lock blocks the settings-reload daemon
  thread while any executor or audio-callback thread is mid-
  generation (TTSModel.generate_audio is not thread-safe per v2 docs).
- frames_after_eos defaults to None so pocket-tts auto-picks
  1-3 trailing frames based on text length instead of our old 3.
- preload_voice_states() warms the cache for a set of voice IDs
  with a progress callback; skips builtins (pocket-tts resolves
  them from HF cache lazily anyway).
- on_model_reloaded callback fires after successful load so
  wingman_core can schedule a preload for the new model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services/voice_preprocessing.py: pure-Python pipeline (soundfile +
scipy, no ffmpeg) that takes arbitrary wav/mp3/flac/ogg/aiff uploads
and produces the format PocketTTS clones best from:

  - mono downmix
  - resample to 24 kHz (model sample rate)
  - RMS-gated silence trim on both ends
  - peak-normalize to -1 dBFS
  - cap to 20 s (v2 destabilizes on long prompts)
  - write as int16 mono WAV

Raises ValueError with a user-friendly message when the clip is too
short after trimming. supported_input_extensions() exposes the decoder
set for the endpoint layer.

services/file.py: add get_pocket_tts_models_dir() helper. Custom YAML
model configs live there and persist across app updates, matching the
pattern used by custom_voices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration changes:

- pocket_tts: translate language+high_quality to the new model string,
  upgrade legacy English IDs ('english', 'english_2026-01') to the
  pinned 'english_2026-04', drop custom_model_path, default
  quantize=true.
- settings: add spoken_language='multilingual' for upgrading users.
- features.tts_provider: wingman_pro → pocket_tts for default and
  per-wingman configs. Existing Wingman Pro voice settings are left
  in place so users can flip back.
- prompts.system_prompt: reset to the shipped default so everyone
  picks up the new {language_instruction} placeholder. Only applied
  when the wingman/defaults config actually has a system_prompt set
  (no new override injected).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services/settings_service.py:
  - SPOKEN_TO_POCKET_TTS maps each supported language to its canonical
    pocket-tts model ID (en→english_2026-04, de→german, fr→french_24l
    because French has no 6L variant, etc.).
  - On spoken_language change in save_settings(), cascade to:
      - pocket_tts.model (triggers PocketTTS reload)
      - voice_activation.fasterwhisper_config.language (None for
        'multilingual' to let FasterWhisper auto-detect)
    Parakeet has no language parameter and auto-detects.

services/context_builder.py:
  Resolve the {language_instruction} placeholder just before formatting
  the system prompt. Multilingual → 'Respond in whatever language the
  user speaks to you'. Specific language → 'Always respond in <name>'.

prompts/tts-test-praise.md: deleted — obsolete TTS test prompt that's
no longer referenced after the OpenAI-compatible TTS test flow was
reworked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Capture main event loop at startup() so background threads (model
  reload) can schedule coroutines.
- Step 4 of startup() now drives pocket_tts.deferred_init() via
  run_in_executor and broadcasts 'Downloading TTS model...' during
  the load. PocketTTS is instantiated with defer_load=True.
- on_model_reloaded callback: after a settings-triggered model reload,
  schedule _preload_pocket_tts_voices() on the main loop via
  run_coroutine_threadsafe and log any background failure via
  add_done_callback so errors are surfaced.
- _preload_pocket_tts_voices(): iterates unique voices in the current
  tower (or a caller-supplied list), broadcasts LOADING_CONFIG with
  per-voice progress so the existing header bar indicator shows the
  work. Optional restore_ready_state=True flips back to READY when
  done — used for per-voice preloads triggered from the config UI.
- Called from initialize_tower() so switching configs frontloads the
  first-clone latency instead of paying it on the first TTS utterance.

New endpoints:

- POST /pocket_tts/preload_voice?voice=<id>: preload a single voice on
  demand (client calls this when user picks a voice); header bar
  reflects progress and restores READY on completion.
- POST /voices/preprocess: multipart upload, cleans the audio with
  services.voice_preprocessing, writes the result to custom_voices/
  as <name>.wav, wipes any stale per-model safetensors cache for that
  stem, and invalidates just the matching voice_cache entries.

Voice-name validation (_validate_voice_stem): rejects empty / path-
separator / dot-traversal / control char / Windows reserved (CON,
PRN, AUX, NUL, COM1-9, LPT1-9) / >80 char names, and stems that shadow
a known model tag (would collide with <voice>.<model_id>.safetensors).
Post-join commonpath assertion guarantees dst_path lives inside
custom_voices_dir. Upload uses tempfile.mkstemp for a unique temp path
(PID-based naming collided across concurrent requests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v2 resolves built-in voices ('alba', 'cosette', ...) from HuggingFace
on first use, so the old bundled pocket-tts-models/embeddings/ and
pocket-tts-voices/ directories are obsolete. Removed their lookup code:

  - POCKET_TTS_VOICES_DIR / INCLUDED_VOICES_DIR constants
  - _get_pocket_tts_included_voices_dir(), _get_wingman_included_voices_dir()
  - _get_app_dir() (no remaining callers)
  - self.wingman_included_voices_dir attribute + its directory scan
    in get_available_voices
  - sys import (was only used by the _MEIPASS lookup in _get_app_dir)

_resolve_voice_path now only checks the custom_voices directory; bare
predefined names pass straight through to pocket-tts / HF.

WingmanAiCore.spec: collect_all('torchao') so PyInstaller picks up
torchao.dtypes / torchao.kernel / torchao.quantization submodules
that load dynamically. Without this the bundled binary can import
torchao but fail to initialize the modern int8 backend, falling back
to the broken torch.ao path at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously _initialize_fasterwhisper just called WhisperModel(model_size),
which silently downloaded the ~1.5 GB CTranslate2 model via faster_whisper's
internal tqdm — no feedback to the UI. Users would sit on
'Initializing speech-to-text (FasterWhisper)...' for minutes with no sign
of progress.

Now mirrors the Parakeet flow:

  - resolve_faster_whisper_repo() maps a model_size ('distil-large-v3',
    'large-v3', etc.) to its HuggingFace repo ID via FASTER_WHISPER_REPO_MAP
    (a static mirror of faster_whisper.utils._MODELS). Custom repo paths
    like 'Systran/faster-whisper-large-v3' are passed through.
  - _initialize_fasterwhisper() downloads via ModelDownloader.download_huggingface
    into <models>/faster-whisper/<model_size>/ with the standard
    whisper allow_patterns before handing off to WhisperModel. The
    existing FasterWhisper.load() lookup picks up the local dir.
  - on_status callbacks broadcast 'Downloading STT model (FasterWhisper: ...)'
    and 'Initializing speech-to-text...' just like Parakeet.

Also plumbs on_status through switch_provider so runtime settings changes
(model_size / device / compute_type tweaks, provider flips) surface status
too, not just initial startup. WingmanCore injects two callbacks into
SettingsService — stt_status_callback to emit LOADING_CONFIG updates and
stt_done_callback to flip back to READY in a finally — so save_settings
can drive the shared header-bar indicator without needing a direct
reference to set_core_state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds PocketTTSPreloadResult Pydantic model so the OpenAPI client
generates a typed method instead of an untyped dict return.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 19:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Upgrades the local TTS stack to PocketTTS 2.0 (multilingual + model selection/quantization) and adds supporting runtime APIs (model status, voice preloading/precompute, voice preprocessing), while also improving STT model download/init UX and introducing a global spoken_language setting that feeds into prompt formatting and STT/TTS defaults.

Changes:

  • Migrate PocketTTS integration to v2.0 with model IDs, quantization (torchao), LRU voice-state caching, and background (pre)compute/preload flows.
  • Add new core API routes for PocketTTS status/models, voice preloading/precompute, and an upload-based voice preprocessing pipeline.
  • Add spoken_language setting + migration and inject {language_instruction} into the default system prompt; set PocketTTS as the default TTS provider.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
WingmanAiCore.spec Includes torchao artifacts in PyInstaller build to support PocketTTS int8 quantization.
wingman_core.py Adds PocketTTS status/models/preload/precompute/preprocess endpoints; defers PocketTTS init; warms voice cache on tower init.
templates/configs/settings.yaml Updates PocketTTS defaults (model + quantize) and adds spoken_language.
templates/configs/defaults.yaml Injects {language_instruction} and switches default tts_provider to pocket_tts.
services/voice_preprocessing.py New audio preprocessing utility used for voice cloning uploads.
services/stt_provider_manager.py Downloads FasterWhisper models via ModelDownloader with UI status callbacks; status propagation on provider switches.
services/settings_service.py Adds spoken_language cascade logic and status callbacks for runtime STT provider switching.
services/migrations/migration_311_to_312.py Migrates PocketTTS settings to new model/quantize fields, adds spoken_language, resets system prompt, switches default TTS provider.
services/file.py Adds get_pocket_tts_models_dir() helper.
services/context_builder.py Provides {language_instruction} substitution based on spoken_language.
services/audio_player.py Best-effort saving of generated preview audio to disk for later use.
requirements.txt Bumps pocket-tts to 2.0.0 and adds torchao.
providers/pocket_tts.py Major refactor for PocketTTS v2: model IDs, custom model discovery, per-model cache tagging, precompute/preload/status, and concurrency locks.
providers/parakeet.py Adds language cascade behavior (wingman override → global setting).
api/interface.py Updates PocketTTS settings schema and adds spoken_language + preload response model.
Comments suppressed due to low confidence (1)

providers/pocket_tts.py:286

  • unload_model() mutates self.model and clears voice_cache without acquiring _model_swap_lock. Because synthesis runs in a worker thread under _model_swap_lock, a concurrent call to unload_model() (e.g., via the stop endpoint or a settings change) can delete/NULL the model mid-generation and crash. Make load_model()/unload_model() take _model_swap_lock internally (or provide a single public ‘swap’ method that always holds the lock) so all entrypoints are safe.
    def unload_model(self):
        """Unload the model to free resources."""
        if self.model:
            del self.model
            self.model = None

        # Explicitly clear CUDA cache if using GPU to free GPU memory
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        self.voice_cache.clear()


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread providers/pocket_tts.py
Comment thread providers/pocket_tts.py
Comment thread providers/pocket_tts.py
Comment thread wingman_core.py
@Shackless Shackless merged commit af43500 into develop Apr 23, 2026
5 of 6 checks passed
@Shackless Shackless deleted the feature/pockettts-v2 branch April 23, 2026 20:07
Shackless added a commit that referenced this pull request Apr 23, 2026
fixes #377 #389

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants