FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding

Auditory attention decoding (AAD) aims to infer the attended speaker from neural responses in multi-speaker acoustic environments and is a key problem for neuro-steered hearing systems. Although recent studies have achieved encouraging progress, existing AAD models still do not fully exploit frequency domain electroencephalography (EEG) information. In particular, most approaches introduce multi-band information through handcrafted feature extraction or direct cross-band feature concatenation, which mainly exploit frequency information at a shallow level and may overlook band-specific patterns and cross-band interactions. To address these limitations, this paper proposes FAConformer, a frequency-aware CNN-Transformer framework for AAD that explicitly integrates band-specific encoding and adaptive cross-band interaction. Specifically, FAConformer first decomposes EEG signals into multiple frequency bands and assigns each band to an independent CNN-Transformer encoder for band-specific modeling. The resulting band-wise features are then adaptively fused by a carefully designed frequency-aware attention (FAA) module that models cross-band dependencies by treating band-wise features as tokens. Further, band-wise auxiliary supervision (BAS) is introduced to prevent weakly contributing branches from being under-optimized during joint training. In this way, FAConformer performs frequency-aware modeling that more effectively exploits frequency domain information. Extensive experiments on two public AAD datasets with three decision-window lengths demonstrated that FAConformer consistently outperformed 12 competitive baselines, surpassing the current state-of-the-art model by 4.9%. Further analyses of band importance, ablation, and parameter sensitivity verify the effectiveness, robustness, and interpretability of the proposed framework. Code is available at https://github.com/wzwvv/FAConformer.

翻译：听觉注意力解码（AAD）旨在从多说话人声学环境中的神经反应中推断被关注的说话者，是神经控制助听系统的关键问题。尽管近期研究已取得令人鼓舞的进展，但现有AAD模型仍未充分利用频域脑电图（EEG）信息。具体而言，多数方法通过手工特征提取或直接跨频带特征拼接引入多频带信息，这类方法主要利用浅层频率信息，可能忽视频带特异性模式及频带间交互。为解决上述局限，本文提出FAConformer——一种面向AAD的频率感知CNN-Transformer框架，显式整合了频带特异性编码与自适应频带间交互。具体地，FAConformer首先将EEG信号分解为多个频带，并分配独立CNN-Transformer编码器进行频带特异性建模；随后通过精心设计的频率感知注意力（FAA）模块，将频带特征视为令牌以自适应融合并建模跨频带依赖关系；进一步引入频带辅助监督（BAS）机制，防止联合训练中弱贡献分支欠优化。通过上述设计，FAConformer实现了更有效利用频域信息的频率感知建模。在两种决策窗长度的两个公开AAD数据集上的大量实验表明，FAConformer持续优于12个竞争基线模型，较当前最优模型提升4.9%。频带重要性分析、消融实验及参数敏感性分析验证了所提框架的有效性、鲁棒性与可解释性。代码开源于https://github.com/wzwvv/FAConformer。