Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
翻译:识别原始视频素材中的高光时刻对于提升互联网平台上普遍存在的视频编辑效率至关重要。然而,人工标注视频片段的繁重工作为在未见类别视频中应用监督方法制造了障碍。许多视频中缺乏包含对高光检测有价值线索的音频模态,也使得多模态策略难以应用。本文提出一种具有跨模态感知能力的新型无监督高光检测模型。该模型通过自重建任务从图像-音频配对数据中学习具有视觉-音频层级语义的表示。为实现无监督高光检测,我们探究网络的潜在表示,并提出基于k点对比学习的表示激活序列学习(RASL)模块来学习显著的表示激活。为连接视觉模态与音频模态,我们使用对称对比学习(SCL)模块学习配对的视觉与音频表示。此外,在预训练期间同步执行掩码特征向量序列(FVS)重建的辅助任务以增强表示。在推理阶段,跨模态预训练模型仅凭视觉模态即可生成具有配对视觉-音频语义的表示,并通过RASL模块输出高光得分。实验结果表明,所提框架相较于其他先进方法取得了更优性能。