LTX-2：高效的联合视听基础模型 (LTX-2: Efficient Joint Audio-Visual Foundation Model)

Yoav HaCohen,Benny Brazowski,Nisan Chiprut,Yaki Bitterman,Andrew Kvochko,Avishai Berkowitz,Daniel Shalem,Daphna Lifschitz,Dudu Moshe,Eitan Porat,Eitan Richardson,Guy Shiran,Itay Chachy,Jonathan Chetboun,Michael Finkelson,Michael Kupchick,Nir Zabari,Nitzan Guetta,Noa Kotler,Ofir Bibi,Ori Gordon,Poriya Panet,Roi Benita,Shahar Armon,Victor Kulikov,Yaron Inger,Yonatan Shiftan,Zeev Melumian,Zeev Farbman

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

翻译：近期的文生视频扩散模型能够生成引人注目的视频序列，但它们仍然是无声的——缺失了音频所提供的语义、情感和氛围线索。我们推出了LTX-2，这是一个能够以统一方式生成高质量、时间同步的视听内容的开源基础模型。LTX-2由一个非对称双流Transformer构成，包含一个140亿参数的视频流和一个50亿参数的音频流，两者通过具有时序位置嵌入的双向视听交叉注意力层以及用于共享时间步条件化的跨模态AdaLN进行耦合。该架构实现了统一视听模型的高效训练与推理，同时为视频生成分配了比音频生成更多的容量。我们采用多语言文本编码器以实现更广泛的提示理解，并引入了一种模态感知的无分类器引导机制，以改进视听对齐和可控性。除了生成语音，LTX-2还能生成丰富、连贯的音频轨道，这些音频跟随每个场景的角色、环境、风格和情感——并配有自然的背景音和拟音元素。在我们的评估中，该模型在开源系统中实现了最先进的视听质量和提示遵循度，同时以远低于其计算成本和推理时间，交付了与专有模型相当的结果。所有模型权重和代码均已公开发布。