Keyframe extraction aims to sum up a video's semantics with the minimum number of its frames. This paper puts forward a Large Model based Sequential Keyframe Extraction for video summarization, dubbed LMSKE, which contains three stages as below. First, we use the large model "TransNetV21" to cut the video into consecutive shots, and employ the large model "CLIP2" to generate each frame's visual feature within each shot; Second, we develop an adaptive clustering algorithm to yield candidate keyframes for each shot, with each candidate keyframe locating nearest to a cluster center; Third, we further reduce the above candidate keyframes via redundancy elimination within each shot, and finally concatenate them in accordance with the sequence of shots as the final sequential keyframes. To evaluate LMSKE, we curate a benchmark dataset and conduct rich experiments, whose results exhibit that LMSKE performs much better than quite a few SOTA competitors with average F1 of 0.5311, average fidelity of 0.8141, and average compression ratio of 0.9922.
翻译:关键帧提取旨在以最少的帧数概括视频语义。本文提出一种基于大模型的序列关键帧提取方法LMSKE用于视频摘要,该方法包含以下三个阶段:首先,利用大模型"TransNetV21"将视频切分为连续镜头,并采用大模型"CLIP2"生成每个镜头内各帧的视觉特征;其次,开发自适应聚类算法为每个镜头生成候选关键帧,每个候选关键帧位于距聚类中心最近的位置;最后,通过镜头内冗余消除进一步减少候选关键帧,并按镜头时序拼接为最终序列关键帧。为评估LMSKE,我们构建了基准数据集并开展丰富实验,结果表明LMSKE以平均F1值0.5311、平均保真度0.8141和平均压缩比0.9922的性能显著优于多个SOTA竞争者。