Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and language tasks. Our findings reveal that both models accurately predict speech responses in the auditory cortex, with a significant correlation between their brain predictions. Notably, shared speech contextual information between Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain activity, surpassing static semantic and lower-level acoustic-phonetic information. These results underscore the convergence of speech contextual representations in SSL models and their alignment with the neural network underlying speech perception, offering valuable insights into both SSL models and the neural basis of speech and language processing.
翻译:通过自监督学习训练的语音和语言模型在语音和语言感知过程中展现出与大脑活动的高度一致性。然而,鉴于它们不同的训练模态,尚不清楚这些模型是否与相同的神经表征相关。我们通过评估两个代表性自监督学习模型(Wav2Vec2.0和GPT-2)的大脑预测性能来直接探究这一问题,这两个模型分别针对语音和语言任务设计。研究结果显示,两个模型均能准确预测听觉皮层中的语音响应,且其大脑预测之间存在显著相关性。值得注意的是,Wav2Vec2.0和GPT-2共享的语音语境信息解释了大脑活动方差的大部分,其贡献超过了静态语义和低层声学-音位信息。这些结果强调了自监督学习模型中语音语境表征的趋同性,以及它们与支撑语音感知的神经网络的对应关系,为理解自监督学习模型及语音和语言处理的神经基础提供了重要启示。