Multimodality and Attention Increase Alignment in Natural Language Prediction Between Humans and Computational Models

The potential of multimodal generative artificial intelligence (mAI) to replicate human grounded language understanding, including the pragmatic, context-rich aspects of communication, remains to be clarified. Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words. Correspondingly, multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities. To test whether these processes align, we tasked both human participants (N = 200) as well as several state-of-the-art computational models with evaluating the predictability of forthcoming words after viewing short audio-only or audio-visual clips with speech. During the task, the model's attention weights were recorded and human attention was indexed via eye tracking. Results show that predictability estimates from humans aligned more closely with scores generated from multimodal models vs. their unimodal counterparts. Furthermore, including an attention mechanism doubled alignment with human judgments when visual and linguistic context facilitated predictions. In these cases, the model's attention patches and human eye tracking significantly overlapped. Our results indicate that improved modeling of naturalistic language processing in mAI does not merely depend on training diet but can be driven by multimodality in combination with attention-based architectures. Humans and computational models alike can leverage the predictive constraints of multimodal information by attending to relevant features in the input.

翻译：多模态生成式人工智能（mAI）在复现人类有根语言理解能力（包括交际中具有语用、语境丰富性的方面）方面的潜力仍有待阐明。已知人类会利用显著多模态特征（如视觉线索）来促进后续词汇的处理。相应地，多模态计算模型可通过视觉注意力机制整合视觉与语言数据，为下一个词分配概率。为验证这些过程是否对齐，我们要求人类参与者（N = 200）及若干最先进的计算模型在观看仅含音频或含语音的视听短片后，评估后续词汇的可预测性。实验期间，记录模型的注意力权重，并通过眼动追踪获取人类注意力指标。结果表明，与仅依赖单模态的模型相比，人类预测性评估与多模态模型产生的分数更为接近。此外，当视觉与语言语境共同促进预测时，纳入注意力机制使模型与人类判断的对齐程度翻倍。在这些情境中，模型的注意力区域与人类眼动追踪结果显著重叠。我们的结果表明，改善mAI中类自然语言处理的建模不仅依赖于训练数据来源，更可通过多模态与基于注意力的架构相结合来实现。人类与计算模型均可通过关注输入中的相关特征，利用多模态信息赋予的预测约束。