There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision language models. Instead of generating audio directly from video, we use the capabilities of powerful vision language models (VLMs). When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed temporally controlled audio adapters. Our approach surpasses current state-of-the-art methods for converting video to audio, resulting in enhanced synchronization with the visuals and improved alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
翻译:近年来,为无声视频生成声音的任务日益受到关注,主要因其在简化视频后期制作中的实用性。然而,现有的视频-声音生成方法试图直接从视觉表征创建声音,由于视觉表征与音频表征难以对齐,这一过程颇具挑战。本文提出了一种新颖框架SonicVisionLM,旨在通过利用视觉语言模型生成多样化的音效。与从视频直接生成音频不同,我们利用了强大视觉语言模型(VLM)的能力。在输入无声视频时,我们的方法首先使用VLM识别视频中的事件,从而建议与视频内容匹配的潜在声音。这种思路的转变将图像与音频对齐的艰巨任务转化为更为成熟的子问题——即通过流行的扩散模型实现图像到文本以及文本到音频的对齐。为提升大语言模型(LLM)的音频推荐质量,我们收集了一个大规模数据集,将文本描述映射到特定音效,并开发了时间受控音频适配器。我们的方法超越了当前最先进的视频-音频转换方法,实现了与视觉内容的更强同步性,并显著改善了音频与视频组件之间的对齐效果。项目页面:https://yusiissy.github.io/SonicVisionLM.github.io/