DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection

At a cocktail party, humans exhibit an impressive ability to direct their attention. The auditory attention detection (AAD) approach seeks to identify the attended speaker by analyzing brain signals, such as EEG signals. However, current AAD algorithms overlook the spatial distribution information within EEG signals and lack the ability to capture long-range latent dependencies, limiting the model's ability to decode brain activity. To address these issues, this paper proposes a dual attention refinement network with spatiotemporal construction for AAD, named DARNet, which consists of the spatiotemporal construction module, dual attention refinement module, and feature fusion \& classifier module. Specifically, the spatiotemporal construction module aims to construct more expressive spatiotemporal feature representations, by capturing the spatial distribution characteristics of EEG signals. The dual attention refinement module aims to extract different levels of temporal patterns in EEG signals and enhance the model's ability to capture long-range latent dependencies. The feature fusion \& classifier module aims to aggregate temporal patterns and dependencies from different levels and obtain the final classification results. The experimental results indicate that compared to the state-of-the-art models, DARNet achieves an average classification accuracy improvement of 5.9\% for 0.1s, 4.6\% for 1s, and 3.9\% for 2s on the DTU dataset. While maintaining excellent classification performance, DARNet significantly reduces the number of required parameters. Compared to the state-of-the-art models, DARNet reduces the parameter count by 91\%. Code is available at: https://github.com/fchest/DARNet.git.

翻译：在鸡尾酒会场景中，人类展现出卓越的听觉注意定向能力。听觉注意检测方法旨在通过分析脑电信号等大脑信号来识别受注意的说话者。然而，现有AAD算法忽略了脑电信号内部的空间分布信息，且缺乏捕获长程潜在依赖关系的能力，限制了模型解码大脑活动的能力。为解决这些问题，本文提出一种用于听觉注意检测的时空构建双注意力精化网络（DARNet），该网络由时空构建模块、双注意力精化模块以及特征融合与分类器模块构成。具体而言，时空构建模块通过捕获脑电信号的空间分布特征，旨在构建更具表现力的时空特征表示。双注意力精化模块旨在提取脑电信号中不同层次的时间模式，并增强模型捕获长程潜在依赖关系的能力。特征融合与分类器模块则负责聚合来自不同层次的时间模式与依赖关系，并获取最终分类结果。实验结果表明，在DTU数据集上，相较于现有最优模型，DARNet在0.1秒、1秒和2秒时间窗下的平均分类准确率分别提升5.9%、4.6%和3.9%。在保持优异分类性能的同时，DARNet显著减少了所需参数量，较现有最优模型降低91%。代码公开于：https://github.com/fchest/DARNet.git。