Auditory attention decoding (AAD) identifies the attended speech stream in multi-speaker environments by decoding brain signals such as electroencephalography (EEG). This technology is essential for realizing smart hearing aids that address the cocktail party problem and for facilitating objective audiometry systems. Existing AAD research mainly utilizes dichotic environments where different speech signals are presented to the left and right ears, enabling models to classify directional attention rather than speech content. However, this spatial reliance limits applicability to real-world scenarios, such as the "cocktail party" situation, where speakers overlap or move dynamically. To address this challenge, we propose an AAD framework for diotic environments where identical speech mixtures are presented to both ears, eliminating spatial cues. Our approach maps EEG and speech signals into a shared latent space using independent encoders. We extract speech features using wav2vec 2.0 and encode them with a 2-layer 1D convolutional neural network (CNN), while employing the BrainNetwork architecture for EEG encoding. The model identifies the attended speech by calculating the cosine similarity between EEG and speech representations. We evaluate our method on a diotic EEG dataset and achieve 72.70% accuracy, which is 22.58% higher than the state-of-the-art direction-based AAD method.
翻译:听觉注意解码(AAD)通过解码脑电图(EEG)等脑信号,在多说话者环境中识别被注意的语音流。该技术对于实现解决鸡尾酒会问题的智能助听器以及构建客观听力测试系统至关重要。现有AAD研究主要采用双耳分听环境,即将不同语音信号分别呈现于左右耳,使模型能够基于注意方向而非语音内容进行分类。然而,这种空间依赖性限制了其在真实场景(如说话者声音重叠或动态移动的“鸡尾酒会”情境)中的应用。为应对这一挑战,我们提出一种适用于双耳同声环境的AAD框架,该环境通过向双耳呈现相同的混合语音信号以消除空间线索。我们的方法使用独立编码器将EEG信号与语音信号映射到共享潜在空间:采用wav2vec 2.0提取语音特征并通过双层一维卷积神经网络(CNN)进行编码,同时利用BrainNetwork架构进行EEG编码。模型通过计算EEG表征与语音表征之间的余弦相似度来识别被注意的语音。我们在双耳同声EEG数据集上评估该方法,取得了72.70%的准确率,较当前最先进的基于方向的AAD方法提升22.58%。