Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.
翻译:从非侵入式神经记录中重构连续语音是探究人类听觉感知以及构建安全、可扩展的语音脑机接口的基本问题。尽管近期取得进展,可理解的重构仍然难以实现,因为非侵入式记录本质上噪声大、空间模糊,且仅部分保留了感知语音的信息。现有方法直接将神经活动映射到纠缠的语音表征,再通过神经声码器合成波形,导致结果虽在频谱上相似但不可理解。为克服这些局限,我们提出了MindVoice——一种神经到语音的重构框架,利用预训练模型补偿神经记录中不完整的语义与声学信息。MindVoice将重构分解为两条互补通路:一条恢复高层语义内容,另一条估计细粒度声学属性。随后,这些推断出的表征与强大的语音生成模型及上下文语音克隆技术相融合,以合成自然且可理解的语句。在脑电图(EEG)和脑磁图(MEG)上的大量实验表明,MindVoice在多种指标上显著优于现有方法。这些结果证明,预训练先验为弥合噪声神经记录与自然语音之间的差距提供了一种原则性方法,为听觉神经科学研究及非侵入式语音脑机接口展示了一种有前景的尝试。