Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
翻译:识别原始视频材料中的精彩时刻对于提升互联网平台上广泛存在的视频编辑效率至关重要。然而,人工标注素材的大量工作阻碍了监督方法在未见过类别视频上的应用。许多视频中缺乏包含重要精彩片段检测线索的音频模态,使得多模态策略难以实施。本文提出了一种具有跨模态感知能力的新型无监督精彩片段检测模型。该模型通过自重构任务从图像-音频配对数据中学习具有视觉-音频层级语义的表征。为实现无监督精彩片段检测,我们探究网络的隐层表征,提出了基于k点对比学习的表征激活序列学习模块以捕捉显著的表征激活。为连接视觉与音频模态,我们采用对称对比学习模块学习配对的视觉与音频表征。此外,预训练阶段同步进行掩码特征向量序列重构的辅助任务以增强表征。推理阶段,该跨模态预训练模型仅需视觉模态即可生成包含配对视觉-音频语义的表征,并由表征激活序列模块输出精彩度分数。实验结果表明,所提框架相比其他最新方法取得了更优性能。