Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we eliminate information related to specific low-level stimulus features (textual, speech, and visual) in the language model representations, and observe how this intervention affects the alignment with fMRI brain recordings acquired while participants read versus listened to the same naturalistic stories. We further contrast our findings with speech-based language models, which would be expected to predict speech-evoked brain activity better, provided they model language processing in the brain well. Using our direct approach, we find that both text-based and speech-based language models align well with early sensory regions due to shared low-level features. Text-based models continue to align well with later language regions even after removing these features, while, surprisingly, speech-based models lose most of their alignment. These findings suggest that speech-based models can be further improved to better reflect brain-like language processing.
翻译:尽管已知大脑在阅读和聆听之间存在差异,但近期研究表明,基于文本的语言模型能够在相当高的程度上预测文本诱发和语音诱发的大脑活动。这引发了一个问题:语言模型究竟能预测大脑中的何种信息类型?我们通过直接方法探究这一问题,在语言模型表征中消除与特定低级刺激特征(文本、语音和视觉)相关的信息,并观察这一干预如何影响其与大脑fMRI记录的对齐——这些记录来自同一批参与者在阅读和聆听相同自然故事时的脑部信号。我们进一步将结果与基于语音的语言模型进行对比,这类模型若能良好模拟大脑语言处理过程,则有望更准确地预测语音诱发的大脑活动。通过直接方法发现:基于文本和语音的语言模型因共享低级特征而与早期感觉区域良好对齐;在去除这些特征后,文本模型仍与后期语言区域保持对齐,而令人惊讶的是,语音模型失去了大部分对齐能力。这些结果表明,语音模型尚需进一步优化,以更真实地反映类脑语言处理机制。