Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of Large Language Model (LLM) Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.
翻译:静默语音接口(SSIs)为非植入式脑机接口提供了一种非侵入性替代方案,实现无声的言语交流。我们提出了多模态口面部神经音频(MONA)系统,该系统通过创新的损失函数——交叉对比损失(crossCon)和监督时间对比损失(supTcon)——实现跨模态对齐,从而训练具有共享潜在表示的多模态模型。该架构使得能够利用如LibriSpeech等纯音频数据集来提升静默语音识别性能。此外,我们引入的大语言模型(LLM)集成评分调整(LISA)方法显著提高了识别准确率。通过MONA LISA的协同作用,在Gaddy(2020)基准数据集的开放词汇静默语音任务中,词错误率(WER)从28.8%降至12.2%。对于基于肌电图(EMG)的声音录制,我们的方法将WER从23.3%改善至3.7%。在2024年Brain-to-Text竞赛中,LISA表现最优,将最佳WER从9.8%提升至8.9%。据我们所知,本研究首次实现了开放词汇非侵入式静默语音识别突破15%的WER阈值,证明SSIs可作为自动语音识别(ASR)的可行替代方案。我们的工作不仅缩小了静默语音与发声语音之间的性能差距,还为噪声环境和数据受限场景下的跨模态方法在人机交互领域开辟了新可能。