Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain

Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities still remains unclear. In this work, we present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. Using brain recordings of participants watching a popular TV show, we analyze the effects of multi-modal connections and interactions in a pre-trained multi-modal video transformer on the alignment with uni- and multi-modal brain regions. We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities. However, we don't find evidence of brain-relevant information captured by the joint multi-modal transformer representations beyond that captured by all of the individual modalities. We finally show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences. Overall, our results paint an optimistic picture of the ability of multi-modal transformers to integrate vision and language in partially brain-relevant ways but also show that improving the brain alignment of these models may require new approaches.

翻译：整合来自多模态的信息可以说是让人工智能系统具备对真实世界理解的基础前提之一。近年来，能够随时间从视觉、文本和声音中联合学习的视频Transformer取得了进展，但模型在多大程度上整合了来自不同模态的信息仍不明确。本研究提出了一种有前景的方法，通过利用大脑中多模态信息处理的神经科学证据，来探测预训练的多模态视频Transformer模型。我们利用参与者观看热门电视节目时的大脑记录，分析了预训练多模态视频Transformer中多模态连接与交互对单模态和多模态脑区对齐的影响。我们发现证据表明，在语言处理过程中，视觉增强了掩码预测性能，这支持了模型中的跨模态表征能够使单个模态受益。然而，我们没有发现证据表明联合多模态Transformer表征捕获了超出所有单模态表征总和之外的大脑相关信息。最后，我们证明，通过使用需要视觉-语言推理的任务进行微调，可以改善预训练联合表征的大脑对齐。总体而言，我们的结果表明，多模态Transformer能够以部分与大脑相关的方式整合视觉和语言，这描绘了一幅乐观的图景，但也表明，改善这些模型的大脑对齐可能需要新的方法。