In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
翻译:本文提出通用全息音频生成任务(UniHAGen),旨在合成涵盖多种领域(例如环境事件、乐器及人类语音)的全面听觉场景,包括屏幕内与屏幕外声音。现有基于视频条件的音频生成模型通常仅聚焦于生成与可见发声事件对应的屏幕内环境声音,忽略了屏幕外听觉事件。近期全息联合文本-视频到音频生成模型虽可生成包含屏幕内外声音的听觉场景,但局限于非语音声音,缺乏生成或融合人类语音的能力。为突破这些限制,我们提出了OmniSonic——一种基于流匹配的扩散框架,联合以视频和文本为条件。其采用TriAttn-DiT架构,通过三次交叉注意力操作同时处理屏幕内环境声音、屏幕外环境声音及语音条件,并引入混合专家(Mixture-of-Experts,MoE)门控机制,在生成过程中自适应平衡各类条件的贡献。此外,我们构建了UniHAGen-Bench——一个包含上千个样本的新基准,覆盖三类典型的屏幕内/外语音-环境混合场景。大量实验表明,在客观指标与人类评估上,OmniSonic均持续超越现有最优方法,为通用全息音频生成建立了强基线。项目主页:https://weiguopian.github.io/OmniSonic_webpage/