In complex auditory environments, the human auditory system possesses the remarkable ability to focus on a specific speaker while disregarding others. In this study, a new model named SWIM, a short-window convolution neural network (CNN) integrated with Mamba, is proposed for identifying the locus of auditory attention (left or right) from electroencephalography (EEG) signals without relying on speech envelopes. SWIM consists of two parts. The first is a short-window CNN (SW$_\text{CNN}$), which acts as a short-term EEG feature extractor and achieves a final accuracy of 84.9% in the leave-one-speaker-out setup on the widely used KUL dataset. This improvement is due to the use of an improved CNN structure, data augmentation, multitask training, and model combination. The second part, Mamba, is a sequence model first applied to auditory spatial attention decoding to leverage the long-term dependency from previous SW$_\text{CNN}$ time steps. By joint training SW$_\text{CNN}$ and Mamba, the proposed SWIM structure uses both short-term and long-term information and achieves an accuracy of 86.2%, which reduces the classification errors by a relative 31.0% compared to the previous state-of-the-art result. The source code is available at https://github.com/windowso/SWIM-ASAD.
翻译:在复杂听觉环境中,人类听觉系统具有专注于特定说话人而忽略其他声音的卓越能力。本研究提出了一种名为SWIM的新模型,该模型将短窗口卷积神经网络(CNN)与Mamba相结合,用于从脑电图(EEG)信号中识别听觉注意焦点(左或右),且无需依赖语音包络。SWIM包含两部分:第一部分是短窗口CNN(SW$_\text{CNN}$),作为短期EEG特征提取器,在广泛使用的KUL数据集上采用留一说话人设置实现了84.9%的最终准确率。这一提升得益于改进的CNN结构、数据增强、多任务训练和模型组合。第二部分Mamba是一种首次应用于听觉空间注意解码的序列模型,用于利用来自先前SW$_\text{CNN}$时间步的长期依赖关系。通过联合训练SW$_\text{CNN}$和Mamba,所提出的SWIM结构同时利用短期和长期信息,实现了86.2%的准确率,与先前最先进结果相比,分类错误率相对降低了31.0%。源代码可在https://github.com/windowso/SWIM-ASAD获取。