The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.
翻译:随着在线视频内容的快速增长,尤其是在短视频平台上,对能够将长视频浓缩为简洁且引人入胜的片段的自动化视频编辑技术的需求日益增长。现有的自动编辑方法主要依赖于自动语音识别(ASR)转录文本中的文本线索和端到端的片段选择,往往忽略了丰富的视觉上下文,导致输出内容不连贯。本文提出了一种人类启发的自动视频编辑框架(HIVE),它利用多模态叙事理解来解决这些局限性。我们的方法通过多模态大语言模型整合了角色提取、对话分析和叙事摘要,从而实现对视频内容的整体理解。为了进一步增强连贯性,我们应用场景级分割,并将编辑过程分解为三个子任务:亮点检测、开头/结尾选择以及无关内容的剪裁。为了促进该领域的研究,我们引入了DramaAD,这是一个新颖的基准数据集,包含超过800个短剧集和500个专业编辑的广告片段。实验结果表明,我们的框架在通用和面向广告的编辑任务中均持续优于现有基线方法,显著缩小了自动编辑视频与人工编辑视频之间的质量差距。