Auditory attention decoding (AAD) aims to infer the attended speaker from neural responses in multi-speaker acoustic environments and is a key problem for neuro-steered hearing systems. Although recent studies have achieved encouraging progress, existing AAD models still do not fully exploit frequency domain electroencephalography (EEG) information. In particular, most approaches introduce multi-band information through handcrafted feature extraction or direct cross-band feature concatenation, which mainly exploit frequency information at a shallow level and may overlook band-specific patterns and cross-band interactions. To address these limitations, this paper proposes FAConformer, a frequency-aware CNN-Transformer framework for AAD that explicitly integrates band-specific encoding and adaptive cross-band interaction. Specifically, FAConformer first decomposes EEG signals into multiple frequency bands and assigns each band to an independent CNN-Transformer encoder for band-specific modeling. The resulting band-wise features are then adaptively fused by a carefully designed frequency-aware attention (FAA) module that models cross-band dependencies by treating band-wise features as tokens. Further, band-wise auxiliary supervision (BAS) is introduced to prevent weakly contributing branches from being under-optimized during joint training. In this way, FAConformer performs frequency-aware modeling that more effectively exploits frequency domain information. Extensive experiments on two public AAD datasets with three decision-window lengths demonstrated that FAConformer consistently outperformed 12 competitive baselines, surpassing the current state-of-the-art model by 4.9%. Further analyses of band importance, ablation, and parameter sensitivity verify the effectiveness, robustness, and interpretability of the proposed framework. Code is available at https://github.com/wzwvv/FAConformer.
翻译:听觉注意力解码(AAD)旨在从多说话人声学环境中的神经反应中推断被关注的说话者,是神经控制助听系统的关键问题。尽管近期研究已取得令人鼓舞的进展,但现有AAD模型仍未充分利用频域脑电图(EEG)信息。具体而言,多数方法通过手工特征提取或直接跨频带特征拼接引入多频带信息,这类方法主要利用浅层频率信息,可能忽视频带特异性模式及频带间交互。为解决上述局限,本文提出FAConformer——一种面向AAD的频率感知CNN-Transformer框架,显式整合了频带特异性编码与自适应频带间交互。具体地,FAConformer首先将EEG信号分解为多个频带,并分配独立CNN-Transformer编码器进行频带特异性建模;随后通过精心设计的频率感知注意力(FAA)模块,将频带特征视为令牌以自适应融合并建模跨频带依赖关系;进一步引入频带辅助监督(BAS)机制,防止联合训练中弱贡献分支欠优化。通过上述设计,FAConformer实现了更有效利用频域信息的频率感知建模。在两种决策窗长度的两个公开AAD数据集上的大量实验表明,FAConformer持续优于12个竞争基线模型,较当前最优模型提升4.9%。频带重要性分析、消融实验及参数敏感性分析验证了所提框架的有效性、鲁棒性与可解释性。代码开源于https://github.com/wzwvv/FAConformer。