Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment. This is done by comparing the listener's brainwaves to a representation of all the sound sources to find the closest match. The representation is typically the waveform or spectrogram of the sounds. The effectiveness of these representations for AAD is uncertain. In this study, we examined the use of self-supervised learned speech representation in improving the accuracy and speed of AAD. We recorded the brain activity of three subjects using invasive electrocorticography (ECoG) as they listened to two conversations and focused on one. We used WavLM to extract a latent representation of each talker and trained a spatiotemporal filter to map brain activity to intermediate representations of speech. During the evaluation, the reconstructed representation is compared to each speaker's representation to determine the target speaker. Our results indicate that speech representation from WavLM provides better decoding accuracy and speed than the speech envelope and spectrogram. Our findings demonstrate the advantages of self-supervised learned speech representation for auditory attention decoding and pave the way for developing brain-controlled hearable technologies.
翻译:听觉注意解码(AAD)是一种用于识别并放大嘈杂环境中听者所关注说话者的技术。该技术通过比较听者的脑电波与所有声源的表征,以寻找最佳匹配项。通常使用的表征是声音的波形或语谱图,但这些表征对AAD的有效性尚不确定。本研究探讨了利用自监督学习语音表征提升AAD准确率与速度的可能性。我们记录了三名受试者在聆听两段对话并聚焦其中一段时,通过侵入式皮层脑电图(ECoG)获取的脑活动数据。采用WavLM提取每个说话者的潜在表征,并训练一个时空滤波器将脑活动映射至语音的中间表征。在评估阶段,将重构表征与各说话者的表征进行比对,以确定目标说话者。结果表明,相较于语音包络和语谱图,WavLM生成的语音表征能提供更优的解码精度与速度。本发现证实了自监督学习语音表征在听觉注意解码中的优势,并为开发脑控可听设备技术奠定了基础。