VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark. Our source code and pre-trained models will be released. Demo is available at: https://www.youtube.com/watch?v=QmqWhUjPkJI.

翻译：近期音频生成领域的进展主要集中于文本到音频（T2A）和视频到音频（V2A）任务。然而，T2A或V2A方法无法生成整体声音（包含画内与画外音）。这是因为T2A无法生成与画内物体对齐的声音，而V2A则无法生成语义完整的声音（缺失画外音）。本研究致力于解决整体音频生成任务：给定一段视频和一个文本提示，我们的目标是生成既与视频时间同步、又与文本及视频语义对齐的画内与画外声音。现有联合文本与视频的音频生成方法常存在模态偏差问题，即过度偏向某一模态。为克服此局限，我们提出VinTAGe——一种基于流的Transformer模型，通过联合考量文本与视频来指导音频生成。该框架包含两个核心组件：视觉-文本编码器与联合VT-SiT模型。为降低模态偏差并提升生成质量，我们采用预训练的单模态文本到音频及视频到音频生成模型提供额外引导。由于缺乏合适的基准数据集，我们同时构建了VinTAGe-Bench，包含636个涵盖画内与画外声音的视频-文本-音频三元组数据集。在VinTAGe-Bench上的综合实验表明，文本与视觉的联合交互对于整体音频生成至关重要。此外，VinTAGe在VGGSound基准测试中取得了最先进的性能。我们将公开源代码与预训练模型。演示视频详见：https://www.youtube.com/watch?v=QmqWhUjPkJI。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日