Model/Pipeline/Scheduler description
We are the JoyAI Team (JD.com), proposing the integration of JoyAI-Echo into Diffusers.
JoyAI-Echo is a unified framework for long-form audio-visual generation, designed to support minute-level video creation with synchronized audio, strong temporal consistency, and real-time interaction.
Key innovations:
- Cross-modal audio-visual memory bank: preserves character appearance and voice timbre across long sequences (up to minutes)
- DMD-distilled few-step inference: ~7.5× faster than baseline while improving alignment and visual quality
- Joint audio-video generation: a single pipeline produces synchronized video and audio
- Multi-shot story generation: generates coherent sequences of shots from prompt lists
The architecture builds on LTX-2 and adds the JoyAI-Echo DMD denoising schedule plus a paired audio-video memory bank for cross-shot consistency.
Open source status
Provide useful links for the implementation
Additional context
We (JoyAI Team) previously contributed JoyAI-Image-Edit to Diffusers (PR #13444, merged). This follows the same pattern — official team providing a complete, tested implementation.
Model/Pipeline/Scheduler description
We are the JoyAI Team (JD.com), proposing the integration of JoyAI-Echo into Diffusers.
JoyAI-Echo is a unified framework for long-form audio-visual generation, designed to support minute-level video creation with synchronized audio, strong temporal consistency, and real-time interaction.
Key innovations:
The architecture builds on LTX-2 and adds the JoyAI-Echo DMD denoising schedule plus a paired audio-video memory bank for cross-shot consistency.
Open source status
Provide useful links for the implementation
Additional context
We (JoyAI Team) previously contributed JoyAI-Image-Edit to Diffusers (PR #13444, merged). This follows the same pattern — official team providing a complete, tested implementation.