Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.
翻译:训练一个集成了视频到音频(V2A)、文本到音频(T2A)以及联合视频-文本到音频(VT2A)生成的统一模型,提供了显著的应用灵活性,但面临两个尚未探索的基础性挑战:(1)缺乏具有紧密音频-视频-文本(A-V-T)对齐的高质量音频描述,导致多模态条件之间存在严重的语义冲突;(2)跨任务与任务内竞争,表现为不利的V2A与T2A性能权衡以及VT2A任务中的模态偏差。首先,为解决数据稀缺问题,我们引入了SoundAtlas,这是一个大规模数据集(47万对),其质量显著优于现有基准甚至人类专家。通过一种新颖的智能体流程驱动,它集成了视觉到语言压缩以减轻多模态大语言模型的视觉偏差、一个将成本降低5倍的初级-高级智能体交接机制,以及严格的后验过滤以确保保真度。因此,SoundAtlas提供了语义丰富、时序细节详尽且具有紧密V-A-T对齐的描述。其次,我们提出了Omni2Sound,一个支持灵活输入模态的统一VT2A扩散模型。为解决固有的跨任务与任务内竞争,我们设计了一个三阶段多任务渐进式训练方案,该方案将跨任务竞争转化为联合优化,并减轻了VT2A任务中的模态偏差,同时保持了音频-视觉对齐和屏幕外音频生成的忠实度。最后,我们构建了VGGSound-Omni,一个用于统一评估的综合基准,包括具有挑战性的屏幕外音轨。采用标准的DiT主干网络,Omni2Sound在单一模型内实现了所有三项任务的统一SOTA性能,并在具有异构输入条件的基准测试中展现出强大的泛化能力。项目页面位于 https://swapforward.github.io/Omni2Sound。