Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

from arxiv, Accepted by WACV 2024; well-formatted PDF is in https://drive.google.com/file/d/1qvW52lamsvNGMCqPS7q8g8L4NaR_LlbR/view?usp=sharing. arXiv admin note: text overlap with arXiv:2401.04023

Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.

翻译：音频和视频是主流媒体平台（如YouTube）上最常见的两种模态。为了有效学习多模态视频，本文提出了一种名为音视频Transformer（AVT）的新型音视频识别方法，该方法利用视频Transformer有效的时空表示来提升动作识别精度。在多模态融合方面，跨模态Transformer中简单拼接多模态标记需要大量计算和内存资源，我们通过音视频瓶颈Transformer降低了跨模态复杂度。为提升多模态Transformer的学习效率，我们将自监督目标（即音视频对比学习、音视频匹配以及掩码音频和视频学习）整合到AVT训练中，将多样化的音频和视频表示映射到共同的多模态表示空间。我们还提出一种掩码音频片段损失函数来学习AVT中的语义音频活动。在三个公开数据集和两个内部数据集上的大量实验和消融研究一致证明了所提AVT的有效性。具体而言，AVT在Kinetics-Sounds数据集上以8%的优势超越此前最先进方法；通过利用音频信号，AVT在VGGSound数据集上相比此前最先进的视频Transformer [25] 提升10%的性能；与先前最先进的多模态方法MBT [32] 相比，AVT在FLOPs上效率提升1.3%，并在Epic-Kitchens-100数据集上将准确率提升3.8%。