With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challenge. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective, ignoring the valuable information hidden in single frames, while several earlier studies only recognize scenes for separate images in a non-temporal perspective. We argue that these two perspectives are both meaningful for this task and complementary to each other, meanwhile, externally introduced knowledge can also promote the comprehension of videos. We propose a novel two-stream framework to model video representations from multiple perspectives, i.e. temporal and non-temporal perspectives, and integrate the two perspectives in an end-to-end manner by self-distillation. Besides, we design a knowledge-enhanced feature fusion and label prediction method that contributes to naturally introducing knowledge into the task of video scene recognition. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method.
翻译:随着实际应用中视频数据的爆炸式增长,对视频进行全面表示的需求日益重要。本文聚焦视频场景识别问题,其目标在于学习高层视频表示以对视频中的场景进行分类。由于真实场景中视频内容的多样性与复杂性,该任务仍面临挑战。现有研究多从时序视角仅基于视觉或文本信息识别视频场景,忽略了单帧图像中隐藏的宝贵信息,而部分早期研究则从非时序视角对独立图像进行场景识别。我们认为这两种视角对该任务均具有重要意义且相互补充,同时外部引入的知识也能促进对视频的理解。为此,本文提出一种新颖的双流框架,从时序与非时序多视角建模视频表示,并通过自蒸馏机制以端到端方式融合两个视角。此外,我们设计了知识增强的特征融合与标签预测方法,从而自然地将知识引入视频场景识别任务。在真实数据集上的实验证明了所提方法的有效性。