Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.
翻译:基于皮层脑电图(ECoG)的侵入式脑机接口在医疗应用中已展现出实现高性能语音解码的潜力,但如颅内立体脑电图(sEEG)等损伤更小的方法仍未被充分探索。随着表征学习的快速发展,利用大量记录数据来增强语音解码能力日益受到关注。然而,主流方法通常基于脑区级标记预训练时序模型,忽视了不同脑区在任务执行期间活动的高度不同步性。另一种方案则基于通道级标记预训练时空模型,但未在语音解码这类需要特定语言相关脑区进行复杂处理的挑战性任务上对其有效性进行验证。为解决该问题,我们采集了一个来自12名受试者、针对语言相关脑网络的标注完善的中文词汇朗读sEEG数据集。基于此基准数据集,我们开发了Du-IN模型,该模型通过离散编码本引导的掩码建模方法,基于脑区级标记提取上下文嵌入表示。我们的模型在61词分类任务上取得了最先进的性能,超越了所有基线方法。模型对比与消融研究表明,我们的设计选择——包括(i)通过一维深度卷积融合腹侧感觉运动皮层(vSMC)和颞上回(STG)的通道信息,构建基于脑区级标记的时序建模;(ii)通过离散编码本引导的掩码建模实现自监督学习——对性能提升具有显著贡献。总体而言,我们的方法受神经科学发现启发,充分利用特定脑区的区域级表征,适用于侵入式脑建模,为脑机接口领域提供了一种具有前景的神经启发式人工智能路径。