Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
翻译:摘要:从神经活动中重建自然语音对于通过脑机接口实现直接通信至关重要。先前的研究尝试利用在大量神经记录数据上训练的复杂深度神经网络(DNN)模型,将神经记录转化为语音,但这在常规临床条件下资源消耗巨大。然而,由于语音表征的复杂性及神经数据的局限性,基于有限规模的神经记录重建语音并达到满意性能一直充满挑战。为克服这些困难,我们提出了一种新颖的神经驱动语音重建迁移学习框架——Neural2Speech,该框架包含两个不同的训练阶段。首先,利用易于获取的语音语料库预训练一个语音自编码器,从编码后的语音表征中解码出语音波形。其次,在小规模神经记录数据上训练一个轻量级适配器,用于对齐神经活动与语音表征以进行解码。值得注意的是,我们提出的Neural2Speech即便仅使用20分钟的颅内数据,也能证明神经驱动语音重建的可行性,并在语音保真度和清晰度方面显著优于现有基线方法。