The proliferation of online short video platforms has driven a surge in user demand for short video editing. However, manually selecting, cropping, and assembling raw footage into a coherent, high-quality video remains laborious and time-consuming. To accelerate this process, we focus on a user-friendly new task called Video Moment Montage (VMM), which aims to accurately locate the corresponding video segments based on a pre-provided narration text and then arrange these video clips to create a complete video that aligns with the corresponding descriptions. The challenge lies in extracting precise temporal segments while ensuring intra-sentence and inter-sentence context consistency, as a single script sentence may require trimming and assembling multiple video clips. To address this problem, we present a novel \textit{Text-Video Multi-Grained Integration} method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features, which enables the global and fine-grained alignment between the video content and the corresponding textual descriptions in the script. To facilitate further research in this area, we introduce the Multiple Sentences with Shots Dataset (MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct extensive experiments on the MSSD dataset to demonstrate the effectiveness of our framework compared to baseline methods.
翻译:在线短视频平台的激增推动了用户对短视频编辑需求的急剧增长。然而,手动从原始素材中选择、裁剪并组装成连贯的高质量视频仍然费时费力。为加速这一过程,我们聚焦于一项用户友好的新任务——视频片段蒙太奇(VMM),其目标在于根据预先提供的叙述文本准确定位对应的视频片段,进而编排这些视频剪辑以生成与文本描述相符的完整视频。该任务的挑战在于提取精确时间片段的同时,需确保句子内部与句子间的上下文一致性,因为单个脚本句子可能需要裁剪并组合多个视频片段。为解决此问题,我们提出一种新颖的文本-视频多粒度融合方法(TV-MGI),该方法能高效地将脚本中的文本特征与镜头级及帧级视频特征相融合,从而实现视频内容与脚本对应文本描述之间的全局及细粒度对齐。为促进该领域的进一步研究,我们构建了专门针对VMM任务的大规模数据集——多语句镜头数据集(MSSD)。我们在MSSD数据集上进行了大量实验,结果表明相较于基线方法,我们提出的框架具有显著优越性。