Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.
翻译:近期推出的统一音频生成模型能够支持语音、音效和音乐等多种任务,但大多数模型仍局限于孤立的任务级合成。然而,真实视频制作往往需要为同一视频联合且一致地生成完整音轨的多个组成部分。我们提出Foley-Omni,一种统一的多模态音频生成模型,通过将语音、音效和音乐共同建模于共享的潜在生成过程中,将孤立的任务级合成扩展至完整视频配乐生成。为支持训练与可重复评估,我们开发了一套音视频数据整理流程,并引入用于整体视频配乐生成评估的基准数据集V2ST-Bench。实验表明,Foley-Omni在单项合成任务上达到与专家系统相当的性能,同时在混合配乐生成中提升了语音可懂度、音视频一致性与感知质量。