USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis

Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at https://github.com/aiot-lab/USpeech/.

翻译：语音增强在人机交互中至关重要，尤其对于泛在设备而言。基于超声的语音增强因其卓越的普适性和性能而成为一种有吸引力的选择。然而，音频-超声数据采集过程中来自意外和非预期源的不可避免的干扰，使得现有解决方案在数据收集和处理方面严重依赖人力投入。这导致了显著的数据稀缺性，限制了基于超声的语音增强技术的全部潜力。为解决此问题，我们提出了USpeech，一种以最小人力成本实现语音增强的跨模态超声合成框架。其核心是一个两阶段框架，通过利用可听音频作为桥梁，建立视觉模态与超声模态之间的对应关系。该方法克服了配对视频-超声数据集缺失以及视频与超声数据之间固有异质性带来的挑战。我们的框架结合了对比式视频-音频预训练，将不同模态映射到共享语义空间，并采用音频-超声编码器-解码器进行超声合成。随后，我们提出了一种语音增强网络，该网络在时频域中增强语音，并通过神经声码器恢复纯净的语音波形。综合实验表明，USpeech使用与物理数据相当的合成超声数据取得了显著性能，明显优于最先进的基于超声的语音增强基线方法。USpeech已在 https://github.com/aiot-lab/USpeech/ 开源。