There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision language models. Instead of generating audio directly from video, we use the capabilities of powerful vision language models (VLMs). When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed temporally controlled audio adapters. Our approach surpasses current state-of-the-art methods for converting video to audio, resulting in enhanced synchronization with the visuals and improved alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
翻译:为无声视频生成声音的任务日益引起关注,这主要源于其在简化视频后期制作中的实用性。然而,现有的视频-声音生成方法试图直接从视觉表征中创建声音,由于视觉表征与音频表征的对齐存在困难,这一过程颇具挑战。本文提出了SonicVisionLM,一种通过利用视觉语言模型生成广泛音效的新颖框架。该方法并非直接从视频生成音频,而是借助强大的视觉语言模型(VLM)的能力。当给定无声视频时,我们的方法首先利用VLM识别视频中的事件,以建议与视频内容匹配的潜在声音。这一思路转变将图像与音频对齐的艰巨任务,转化为通过流行扩散模型实现图像到文本与文本到音频对齐这两个更成熟子问题。为提升大语言模型推荐音频的质量,我们收集了将文本描述映射到特定音效的大规模数据集,并开发了时间可控的音频适配器。我们的方法超越了当前视频到音频转换的最新方法,在视觉同步性与音频-视频组件对齐方面均实现了更优表现。项目页面:https://yusiissy.github.io/SonicVisionLM.github.io/