Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed.
翻译:深度学习在视频理解任务中取得了显著进展,但使用片段级视频分类器对长时长、大规模视频进行分类所需的计算量仍然不切实际且成本过高。为解决这一问题,我们提出了音视频瞥视网络(Audio-Visual Glance Network, AVGN),该网络利用常见的音频和视觉模态高效处理视频中时空重要的部分。AVGN首先将视频划分为图像-音频片段对,并使用轻量级单模态编码器提取全局视觉特征和音频特征。为识别重要的时间片段,我们使用音视频时间显著性变换器(Audio-Visual Temporal Saliency Transformer, AV-TeST)估计每帧的显著性得分。为进一步提升空间维度的效率,AVGN仅处理重要图块而非整幅图像。我们采用音频增强空间图块注意力模块(Audio-Enhanced Spatial Patch Attention, AESPA)生成一组增强的粗粒度视觉特征,将其输入策略网络以生成重要图块的坐标。该方法使我们能够仅聚焦于视频中最关键的时空部分,从而实现更高效的视频识别。此外,我们整合了多种训练技巧和多模态特征融合技术,以增强AVGN的鲁棒性和有效性。通过结合这些策略,我们的AVGN在多个视频识别基准上取得了新的最佳性能,同时实现了更快的处理速度。