Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.
翻译:基于注意力的模型在多模态处理中具有吸引力,因为来自不同模态的输入可以被拼接并输入到单一骨干网络中,从而几乎无需融合工程设计。然而,由此产生的表示在整个网络中完全纠缠,这并非总是理想的:在学习中,对比性视听自监督学习需要独立的音频和视觉特征才能正常运行,否则学习会崩溃;在推理中,音频-视觉模型的评估应能在仅包含音频或仅包含视频的基准上实现。在本文中,我们提出了Zorro,这是一种利用掩码来控制每个模态输入在Transformer内部路由的技术,从而保留部分表示中的模态纯净性。我们将该技术应用于三种流行的基于Transformer的架构(ViT、Swin和HiP),并表明通过对比预训练,Zorro在多模态任务的最相关基准(AudioSet和VGGSound)上达到了最先进的结果。此外,生成的模型能够在视频和音频基准(如Kinetics-400或ESC-50)上执行单模态推理。