Skip to content

[New Pipeline/Model] Add JoyAI-Echo multi-shot audio-video generation pipeline #13909

@sjq66

Description

@sjq66

Model/Pipeline/Scheduler description

We are the JoyAI Team (JD.com), proposing the integration of JoyAI-Echo into Diffusers.

JoyAI-Echo is a unified framework for long-form audio-visual generation, designed to support minute-level video creation with synchronized audio, strong temporal consistency, and real-time interaction.

Key innovations:

  • Cross-modal audio-visual memory bank: preserves character appearance and voice timbre across long sequences (up to minutes)
  • DMD-distilled few-step inference: ~7.5× faster than baseline while improving alignment and visual quality
  • Joint audio-video generation: a single pipeline produces synchronized video and audio
  • Multi-shot story generation: generates coherent sequences of shots from prompt lists

The architecture builds on LTX-2 and adds the JoyAI-Echo DMD denoising schedule plus a paired audio-video memory bank for cross-shot consistency.

Open source status

  • The model implementation is available.
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Additional context

We (JoyAI Team) previously contributed JoyAI-Image-Edit to Diffusers (PR #13444, merged). This follows the same pattern — official team providing a complete, tested implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions