Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.
翻译:现有的视频到音频(V2A)生成方法主要依赖文本提示和视觉信息来合成音频。然而,两个关键瓶颈依然存在:训练数据中的语义粒度差距(例如,将声学上不同的声音归并在粗糙的标签下)以及描述微观声学特征时的文本模糊性。这些瓶颈使得使用文本控制模式进行细粒度声音合成变得困难。为了应对这些局限性,我们提出了AC-Foley,一种音频条件的V2A模型,它直接利用参考音频来实现对生成声音的精确和细粒度控制。这种方法实现了细粒度声音合成、音色迁移、零样本声音生成以及更高的音频质量。通过直接以音频信号为条件,我们的方法绕过了文本描述的语义模糊性,同时实现了对声学属性的精确操控。实验表明,在以参考音频为条件时,AC-Foley在拟音生成任务上取得了最先进的性能,即使在无音频条件的情况下,也能与最先进的视频到音频方法保持竞争力。