In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
翻译:近年来,研究者结合音频与视频信号以应对动作无法通过视觉线索充分表示或捕获的挑战。然而,如何有效利用这两种模态仍处于探索阶段。本文提出了一种多尺度多模态Transformer(MMT),其利用分层表示学习机制。具体而言,MMT由新型多尺度音频Transformer(MAT)与多尺度视频Transformer[43]构成。为学习具有判别性的跨模态融合,我们进一步设计了多模态监督对比目标函数,即音视频对比损失(AVC)与模态内对比损失(IMC),可鲁棒地对齐两种模态。在不使用外部训练数据的情况下,MMT在Kinetics-Sounds和VGGSound数据集上的top-1准确率分别超越现有最优方法7.3%和2.1%。此外,所提出的MAT在三个公开基准数据集上分别较AST[28]提升22.2%、4.4%和4.7%,且基于FLOPs计算效率提高约3%,GPU内存使用效率提升9.8%。