Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
翻译:语音脑机接口旨在通过将神经活动转化为文本来恢复瘫痪患者的交流能力。大多数系统采用级联框架,即先解码音素,再通过n-gram语言模型组合成句子,这阻碍了所有阶段的联合优化。本文提出一种端到端的脑到文本框架,利用单一可微分神经网络将神经活动直接翻译为连贯语句。该方法的核心是一个跨任务、跨物种的预训练神经编码器,其表征可迁移至尝试性语音与想象性语音。在结合n-gram语言模型的级联设置中,该预训练编码器在Brain-to-Text '24和'25基准测试中取得了最新最优性能。通过与音频大语言模型进行端到端集成,并采用对比学习进行跨模态对齐训练,本方法将先前端到端方法的词错误率从24.69%降至10.22%。值得注意的是,我们发现小规模音频大语言模型能显著提升端到端解码性能。除突破性的性能表现外,该框架通过对齐尝试性语音与想象性语音的嵌入表征,实现了跨任务泛化能力。总体而言,本研究推动了大规模多样化神经数据集的整合,为支持无缝可微分优化的端到端解码框架奠定了基础。