Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
翻译:从神经活动中重建自然语音对于通过脑机接口实现直接通信至关重要。先前的研究尝试利用在大量神经记录数据上训练的复杂深度神经网络模型,将神经记录转换为语音,但这在常规临床条件下资源消耗巨大。然而,由于语音表示的复杂性及神经数据的限制,从有限规模的神经记录中重建语音一直难以取得令人满意的性能。为应对这些挑战,我们提出了一种新颖的面向神经驱动语音重建的迁移学习框架——Neural2Speech,该框架包含两个独立的训练阶段。首先,在现成可用的语音语料库上预训练一个语音自编码器,以从编码后的语音表示中解码语音波形。其次,在小规模神经记录数据上训练一个轻量级适配器,用于对齐神经活动与语音表示以进行解码。值得注意的是,我们提出的Neural2Speech即使仅使用20分钟的颅内数据,也证明了神经驱动语音重建的可行性,且在语音保真度和可懂度方面显著优于现有基线方法。