Video summarization aims to generate a concise representation of a video, capturing its essential content and key moments while reducing its overall length. Although several methods employ attention mechanisms to handle long-term dependencies, they often fail to capture the visual significance inherent in frames. To address this limitation, we propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations and applies 2D CNN to these frame features. Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images. In contrast to previous work compromising efficiency by designing additional modules to focus on spatial importance, CSTA requires minimal computational overhead as it uses CNN as a sliding window. Extensive experiments on two benchmark datasets (SumMe and TVSum) demonstrate that our proposed approach achieves state-of-the-art performance with fewer MACs compared to previous methods. Codes are available at https://github.com/thswodnjs3/CSTA.
翻译:视频摘要旨在生成视频的简洁表示,捕捉其关键内容和重要时刻,同时减少整体长度。尽管多种方法采用注意力机制处理长程依赖,但它们往往无法捕捉帧内固有的视觉显著性。为解决这一局限,我们提出一种基于CNN的时空注意力(CSTA)方法,该方法将单个视频中每一帧的特征堆叠成类似图像的帧表示,并对这些帧特征应用二维CNN。我们的方法依赖CNN来理解帧间及帧内关系,并通过利用其学习图像中绝对位置的能力,找出视频中的关键属性。与以往通过设计额外模块以聚焦空间重要性而牺牲效率的工作不同,CSTA将CNN用作滑动窗口,因此仅需极少的计算开销。在两个基准数据集(SumMe和TVSum)上的广泛实验表明,与先前方法相比,我们提出的方法以更少的乘加操作次数(MACs)达到了最先进的性能。代码可在https://github.com/thswodnjs3/CSTA获取。