Toward a realistic model of speech processing in the brain with self-supervised learning

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

翻译：近期研究表明，多种深度神经网络在相同输入刺激下可生成与大脑相似的神经激活模式。然而，这些算法仍存在根本性缺陷：它们需要（1）超大规模数据，（2）不可获取的监督标签，（3）文本而非原始感官输入，及/或（4）大得难以置信的存储容量（例如数千个上下文词汇）。这些局限凸显了识别算法在相关约束条件下足以解释行为与脑反应的必要性。聚焦语言处理问题，本研究假设基于原始波形训练的自监督算法是极具潜力的候选方案。具体而言，我们将最新自监督架构Wav2Vec 2.0与412名英语、法语及普通话被试的功能性磁共振成像（fMRI）脑活动数据进行比较，期间受试者聆听约1小时有声读物。研究结果呈现四重发现：第一，该算法仅需600小时无标签语音数据即可习得类脑表征——这一数据量接近婴儿语言习得期的语言输入量；第二，其功能层级结构与语言处理的皮质层级体系高度一致；第三，不同训练范式揭示其功能特化模式与大脑皮质相似：Wav2Vec 2.0学习到的声音通用、语言特异及语种特异表征，与额叶前皮质及颞叶皮质表征特性一致；第四，我们通过386名额外被试的行为数据验证了这种特化模式的相似性。这些源自迄今最大规模神经影像学基准研究的发现，揭示了自监督学习如何解释大脑中语言处理的丰富组织架构，从而为阐明塑造人类大脑的语言习得法则指明了路径。