Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/
翻译:为视频生成音效通常需要创作与真实声源显著不同的艺术化音效,并在声音设计中实现灵活控制。为解决这一问题,我们提出了MultiFoley模型,该模型专为视频引导的声音生成而设计,支持通过文本、音频和视频进行多模态条件控制。给定一段无声视频和一个文本提示,MultiFoley允许用户生成纯净的声音(例如,滑板轮转动声而不含风声)或更具想象力的声音(例如,使狮吼声听起来像猫叫)。用户还可选择音效库中的参考音频或部分视频片段作为条件输入。我们模型的一个关键创新在于其联合训练策略:同时使用含低质量音频的网络视频数据集和专业音效录音进行训练,从而实现了高质量、全带宽(48kHz)音频的生成。通过自动化评估和人工研究,我们证明MultiFoley能够根据多样化的条件输入成功生成同步的高质量声音,其性能优于现有方法。视频结果请参见项目页面:https://ificl.github.io/MultiFoley/