Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.
翻译:通过无声语音接口(SSI)恢复语音已成为面向喉部发音功能受损或缺失患者的一种有前景的辅助技术。在非侵入式SSI模态中,表面肌电信号(sEMG)与基于视频的唇读可提供互补的发音信息,然而二者在连续语音合成中的融合研究尚不充分。此外,现有多模态方法鲜少考虑对模态退化或临时传感器故障的鲁棒性,这限制了它们在现实场景中的适用性。本文提出一种掩码多模态语音合成框架,通过训练阶段的模态掩码技术联合利用sEMG与唇读信号。在多说话人设置下,所提方法相较最强单模态基线将词错误率降低多达14个绝对百分点。实验结果不仅表明掩码策略对低比特率条件下的性能提升与鲁棒性至关重要,还揭示其在模态缺失场景下比针对退化的数据增强方法具有更优的泛化能力。音素级分析进一步揭示了模态间的互补贡献,其中元音与特定辅音组的提升尤为显著。总体而言,本研究证实了掩码多模态融合在无声语音合成中的有效性与鲁棒性,但针对喉切除患者群体的适配仍是一个待解决的研究挑战。