Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at https://huiz-a.github.io/audio4video.github.io/.
翻译:现有研究在视频生成方面取得了进展,但缺乏音效(SFX)和背景音乐(BGM)阻碍了完整且沉浸式的观众体验。我们提出了一种新颖的语义一致视频到音频生成框架,即SVA,该框架能够自动生成与给定视频内容语义一致的音频。该框架利用多模态大语言模型(MLLM)从关键帧中理解视频语义,并生成创意音频方案,随后将这些方案作为文本到音频模型的提示,从而以自然语言为接口实现视频到音频的生成。我们通过案例研究展示了SVA的令人满意的性能,并讨论了其局限性以及未来的研究方向。项目页面访问地址为:https://huiz-a.github.io/audio4video.github.io/。