Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
翻译:从头皮神经记录(脑电图)重建语音音频包络是解码听者注意焦点的核心任务,在神经导向助听器等应用中至关重要。然而,当前的重建方法在保真度和抗噪性方面面临挑战。主流方法将其视为静态回归问题,孤立处理每个脑电图窗口,忽略了连续语音固有的丰富时间结构。本研究引入了一种新的动态包络重建框架,利用该结构作为预测性时间先验。我们提出一种状态空间融合模型,将来自脑电图的直接神经估计与近期语音上下文的预测相结合,通过学习的门控机制自适应地平衡这些线索。为验证该方法,我们在ICASSP 2023刺激重建基准上评估模型,结果表明其较静态的纯脑电图基线有显著提升。分析揭示了神经信息流与时间信息流之间的强协同效应。最终,本研究将包络重建重新定义为动态状态估计问题,而非简单映射,为开发更准确、更连贯的神经解码系统开辟了新方向。