Verbal videos, featuring voice-overs or text overlays, provide valuable content but present significant challenges in composition, especially when incorporating editing effects to enhance clarity and visual appeal. In this paper, we introduce the novel task of verbal video composition with editing effects. This task aims to generate coherent and visually appealing verbal videos by integrating multimodal editing effects across textual, visual, and audio categories. To achieve this, we curate a large-scale dataset of video effects compositions from publicly available sources. We then formulate this task as a generative problem, involving the identification of appropriate positions in the verbal content and the recommendation of editing effects for these positions. To address this task, we propose VCoME, a general framework that employs a large multimodal model to generate editing effects for video composition. Specifically, VCoME takes in the multimodal video context and autoregressively outputs where to apply effects within the verbal content and which effects are most appropriate for each position. VCoME also supports prompt-based control of composition density and style, providing substantial flexibility for diverse applications. Through extensive quantitative and qualitative evaluations, we clearly demonstrate the effectiveness of VCoME. A comprehensive user study shows that our method produces videos of professional quality while being 85$\times$ more efficient than professional editors.
翻译:语音视频(包含旁白或文本叠加)提供了有价值的内容,但在合成方面存在显著挑战,尤其是在融入编辑效果以增强清晰度和视觉吸引力时。本文提出了具备编辑效果的语音视频合成这一新任务。该任务旨在通过整合跨文本、视觉和音频类别的多模态编辑效果,生成连贯且视觉吸引力强的语音视频。为此,我们从公开来源中整理了一个大规模的视频效果合成数据集。随后,我们将此任务构建为一个生成式问题,涉及识别语音内容中的合适位置,并为这些位置推荐编辑效果。为解决此任务,我们提出了VCoME,一个采用大型多模态模型来生成视频合成编辑效果的通用框架。具体而言,VCoME接收多模态视频上下文,并自回归地输出在语音内容中何处应用效果,以及每个位置最适合哪些效果。VCoME还支持基于提示词控制合成密度和风格,为多样化应用提供了极大的灵活性。通过广泛的定量和定性评估,我们清晰地证明了VCoME的有效性。一项全面的用户研究表明,我们的方法能生成专业质量的视频,同时效率比专业编辑人员高出85倍。