Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5$\times$ cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

翻译：训练一个统一模型来整合视频到音频生成、文本到音频生成以及视频-文本联合到音频生成任务，虽然能提供显著的应用灵活性，但面临两个尚未探索的基础性挑战：（1）缺乏具有紧密视频-音频-文本对齐的高质量音频描述，导致多模态条件间出现严重的语义冲突；（2）跨任务与任务内部的竞争，表现为视频到音频生成与文本到音频生成性能的不良权衡，以及视频-文本到音频生成任务中的模态偏置。首先，为解决数据稀缺问题，我们提出SoundAtlas，一个大规模数据集（具有47万对数据），其质量显著优于现有基准甚至人类专家。该数据集依托一种创新的智能体流水线，融合了视觉到语言压缩以减少多模态大语言模型的视觉偏置、通过初级-高级智能体交接实现5倍的成本降低，以及严格的后期过滤以确保保真度。由此，SoundAtlas提供了语义丰富且时间细节详尽的描述，并实现了紧密的视频-音频-文本对齐。其次，我们提出Omni2Sound，一个支持灵活输入模态的统一视频-文本到音频扩散模型。为解决固有的跨任务与任务内竞争，我们设计了一个三阶段的多任务渐进式训练计划，将跨任务竞争转化为联合优化，并缓解视频-文本到音频生成任务中的模态偏置，同时保持音视频对齐与屏幕外音频生成的保真度。最后，我们构建了VGGSound-Omni，一个用于统一评估的综合基准，包含具有挑战性的屏幕外音频轨道。借助标准的DiT骨干网络，Omni2Sound在单一模型中实现了三种任务上的统一最先进性能，展现了跨具有异构输入条件的基准的强泛化能力。