Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark. Our source code and pre-trained models will be released. Demo is available at: https://www.youtube.com/watch?v=QmqWhUjPkJI.
翻译:近期音频生成领域的进展主要集中于文本到音频(T2A)和视频到音频(V2A)任务。然而,T2A或V2A方法无法生成整体声音(包含画内与画外音)。这是因为T2A无法生成与画内物体对齐的声音,而V2A则无法生成语义完整的声音(缺失画外音)。本研究致力于解决整体音频生成任务:给定一段视频和一个文本提示,我们的目标是生成既与视频时间同步、又与文本及视频语义对齐的画内与画外声音。现有联合文本与视频的音频生成方法常存在模态偏差问题,即过度偏向某一模态。为克服此局限,我们提出VinTAGe——一种基于流的Transformer模型,通过联合考量文本与视频来指导音频生成。该框架包含两个核心组件:视觉-文本编码器与联合VT-SiT模型。为降低模态偏差并提升生成质量,我们采用预训练的单模态文本到音频及视频到音频生成模型提供额外引导。由于缺乏合适的基准数据集,我们同时构建了VinTAGe-Bench,包含636个涵盖画内与画外声音的视频-文本-音频三元组数据集。在VinTAGe-Bench上的综合实验表明,文本与视觉的联合交互对于整体音频生成至关重要。此外,VinTAGe在VGGSound基准测试中取得了最先进的性能。我们将公开源代码与预训练模型。演示视频详见:https://www.youtube.com/watch?v=QmqWhUjPkJI。