Improving Speech Decoding from ECoG with Self-Supervised Pretraining

Recent work on intracranial brain-machine interfaces has demonstrated that spoken speech can be decoded with high accuracy, essentially by treating the problem as an instance of supervised learning and training deep neural networks to map from neural activity to text. However, such networks pay for their expressiveness with very large numbers of labeled data, a requirement that is particularly burdensome for invasive neural recordings acquired from human patients. On the other hand, these patients typically produce speech outside of the experimental blocks used for training decoders. Making use of such data, and data from other patients, to improve decoding would ease the burden of data collection -- especially onerous for dys- and anarthric patients. Here we demonstrate that this is possible, by reengineering wav2vec -- a simple, self-supervised, fully convolutional model that learns latent representations of audio using a noise-contrastive loss -- for electrocorticographic (ECoG) data. We train this model on unlabelled ECoG recordings, and subsequently use it to transform ECoG from labeled speech sessions into wav2vec's representation space, before finally training a supervised encoder-decoder to map these representations to text. We experiment with various numbers of labeled blocks; for almost all choices, the new representations yield superior decoding performance to the original ECoG data, and in no cases do they yield worse. Performance can also be improved in some cases by pretraining wav2vec on another patient's data. In the best cases, wav2vec's representations decrease word error rates over the original data by upwards of 50%.

翻译：近期颅内脑机接口研究表明，通过将语音解码问题视为监督学习实例，并训练深度神经网络从神经活动映射到文本，可以实现高精度的语音解码。然而，这类网络因其强大表达能力而需要大量标注数据，这对从人类患者获取的侵入式神经记录而言尤为困难。值得注意的是，这些患者在训练解码器所用的实验区块之外通常也会产生语音。若能利用此类数据及其他患者数据改进解码，将显著减轻数据收集负担——这对构音障碍和无动性缄默症患者尤为重要。本研究通过改造wav2vec模型（一种采用噪声对比损失学习音频潜在表示的简单自监督全卷积模型），将其适配于脑皮层电图数据，证明了该方法的可行性。我们在未标注的ECoG记录上训练该模型，随后利用它将标注语音会话的ECoG数据转换至wav2vec的表示空间，最终训练监督式编码器-解码器将这些表示映射为文本。通过不同数量标注区块的实验发现：在几乎所有情况下，新表示均能获得优于原始ECoG数据的解码性能，且从未出现性能下降。在某些情况下，通过在其他患者数据上预训练wav2vec还能进一步提升性能。最佳情况下，wav2vec表示相较于原始数据可将词错误率降低50%以上。