It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification.One can then consider the proposed method as a new alternative to improve self-supervised speech models.
翻译:已有研究表明,自监督预训练模型提取的语音表征与人类语音感知的脑激活具有相似性,且在具体下游任务上微调语音表征模型可进一步提升这种相似性。然而,这种相似性能否用于优化预训练语音模型仍不明确。为此,本研究提出利用fMRI记录的脑激活数据,通过将模型表征与人类神经响应对齐,对常用的wav2vec2.0模型进行优化。在SUPERB基准上的实验结果表明,该操作对多项下游任务具有增益效果,例如说话人验证、自动语音识别和意图分类。因此,所提方法可视为改进自监督语音模型的一种新途径。