This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization.
翻译:本文提出了一项新任务:文本条件驱动的选择性视频到音频(V2A)生成,其目标是从包含多个物体的视频中仅生成用户所意图的声音。该能力在多媒体制作中尤为关键,因为音频轨道需针对每个声源独立处理,以实现精确编辑、混音和创意控制。我们提出SELVA,一种新颖的文本条件V2A模型,该模型将文本提示视为显式选择器,从视频编码器中清晰提取与提示相关的声源视觉特征。为通过高效的视频编码器微调抑制与文本无关的激活状态,所提出的辅助标记(supplementary tokens)能强化跨注意力机制,从而生成稳健的语义与时间定位。SELVA进一步采用基于自监督方式的自主视频混合方案,以克服单声道音频轨道监督缺失的难题。我们在VGG-MONOAUDIO数据集上评估SELVA,该数据集是为此类任务精心整理的清洁单源视频基准。大量实验与消融研究一致验证了其在音频质量、语义对齐及时序同步方面的有效性。