Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without any prior information on multi-talker conditions. It has been shown via auditory attention decoding that the brain activity of the listener contains the auditory information of the attended speaker. In this paper, we thus propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures. The proposed BASEN is based on the fully-convolutional time-domain audio separation network. In order to fully leverage the complementary information contained in the EEG signals, we further propose a convolutional multi-layer cross attention module to fuse the dual-branch features. Experimental results on a public dataset show that the proposed model outperforms the state-of-the-art method in several evaluation metrics. The reproducible code is available at https://github.com/jzhangU/Basen.git.
翻译:时域单通道语音增强(SE)在无任何先验信息的多说话人场景下提取目标说话人仍具挑战性。听觉注意解码研究表明,听者的大脑活动包含所关注说话人的听觉信息。为此,本文提出一种新颖的时域脑辅助语音增强网络(BASEN),该网络通过融合从听者记录的脑电图(EEG)信号,从单耳语音混合中提取目标说话人。所提出的BASEN基于全卷积时域音频分离网络。为充分利用EEG信号中包含的互补信息,我们进一步提出卷积多层交叉注意力模块以融合双分支特征。在公开数据集上的实验结果表明,所提模型在多项评估指标上均优于现有最优方法。可复现代码见https://github.com/jzhangU/Basen.git。