Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. This paper introduces a novel multimodal speech enhancement framework that integrates bone-conduction sensors with air-conducted microphones using a conditional diffusion model. Our proposed model significantly outperforms previously established multimodal techniques and a powerful diffusion-based single-modal baseline across a wide range of acoustic conditions.
翻译:单通道语音增强模型在极端嘈杂环境中面临显著的性能下降。尽管先前研究表明互补的骨传导语音能够引导增强过程,但如何有效整合这种抗噪声模态仍是一个挑战。本文提出了一种新颖的多模态语音增强框架,通过条件扩散模型将骨传导传感器与空气传导麦克风进行整合。我们提出的模型在广泛的声学条件下,显著优于先前建立的多模态技术以及基于扩散的强单模态基线。