Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.
翻译:基于Transformer的模型在视觉问答、动作识别等多模态理解任务中显著提升了性能。然而,随着模态数量的增加,多模态Transformer严重受限于多头注意力机制对输入序列长度的平方级复杂度问题。为此,我们提出低代价多模态Transformer(LoCoMT)——一种新颖的多模态注意力机制,旨在以最小性能损失降低训练与推理阶段的计算成本。具体而言,通过为每个注意力头分配不同的多模态注意力模式,LoCoMT能够灵活控制多模态信号,并在理论上保证相比现有其他多模态Transformer变体降低计算成本。在Audioset和MedVidCL两个多模态数据集上的实验结果表明,LoCoMT不仅减少了GFLOPs,更在性能上达到甚至超越现有成熟模型。