Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning. Code and demo are available at: https://ff2416.github.io/AC-Foley-Page
翻译:现有视频到音频生成方法主要依赖文本提示与视觉信息联合合成音频。然而,两个关键瓶颈持续存在:训练数据中的语义粒度鸿沟(例如将声学差异显著的声音归入粗粒度标签),以及描述微观声学特征时的文本歧义性。这些瓶颈使得基于文本控制模式的细粒度声音合成难以实现。为突破这些限制,我们提出AC-Foley——一种音频条件化视频到音频模型,可直接利用参考音频实现对生成声音的精确细粒度控制。该方法支持细粒度声音合成、音色迁移、零样本声音生成,并提升音频质量。通过直接以音频信号为条件,本方法规避了文本描述的语义歧义性,同时实现对声学属性的精确操控。实验表明,AC-Foley在以参考音频为条件时实现了拟声生成任务的最优性能,即便在无音频条件下仍能与前沿视频到音频方法保持竞争力。代码与演示示例请参见:https://ff2416.github.io/AC-Foley-Page