We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.
翻译:本研究采用(多)模态深度神经网络(DNNs),通过预测受试者在观看电影时的立体脑电图(SEEG)记录,探究人脑中多模态整合的神经位点。我们将多模态整合位点定义为:多模态视觉-语言模型对神经信号的预测能力优于单模态语言模型、单模态视觉模型或线性整合的视觉-语言模型的脑区。目标DNN模型涵盖不同架构(如卷积网络和Transformer)及多模态训练技术(如交叉注意力与对比学习)。作为关键基础步骤,我们首先证明经过训练的视觉与语言模型在预测SEEG信号方面系统性优于随机初始化的对照模型。随后,我们对单模态与多模态模型进行比较。由于目标DNN模型常具有不同架构、参数量及训练集(可能混淆由整合效应引起的差异),我们对SLIP与SimCLR两个模型进行了控制变量比较,确保除输入模态外所有其他属性保持一致。通过该方法,我们识别出大量存在多模态整合迹象的神经位点(平均141/1090个位点,占12.94%)及脑区。此外,我们发现所评估的多模态训练技术变体中,CLIP风格的训练方法最适用于预测这些位点的神经活动。