Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Simon Leglaive,Matthieu Fraticelli,Hend ElGhazaly,Léonie Borne,Mostafa Sadeghi,Scott Wisdom,Manuel Pariente,John R. Hershey,Daniel Pressnitzer,Jon P. Barker

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

翻译：基于监督学习的语音增强模型通常使用人工合成的纯净语音与噪声混合信号进行训练。然而，这种合成训练条件可能无法准确反映测试时遇到的真实环境。当测试域与合成训练域存在显著差异时，这种不匹配会导致模型性能下降。为解决该问题，第七届CHiME挑战赛的UDASE任务旨在利用测试域中的真实噪声语音录音，实现语音增强模型的无监督域自适应。具体而言，该测试域对应CHiME-5数据集，其特点是在嘈杂且具有混响的家庭环境中采集的真实多说话人对话语音，且无法获得对应的纯净语音信号。本文系统评估了CHiME-7 UDASE任务参赛系统的客观与主观性能，并对结果进行了深入分析。分析表明，主观评分与近期提出的若干监督式非侵入式语音增强性能指标之间相关性有限。相反，结果表明使用本挑战赛开发的混响LibriCHiME-5数据集进行域内评估时，更传统的侵入式客观指标仍具有适用性。主观评估显示所有系统均能有效降低背景噪声，但均以增加语音失真为代价。在主观评估的四种语音增强方法中，仅有一种相较于未处理的带噪语音在整体质量上有所提升，这凸显了该任务的挑战性。本研究已将CHiME-7 UDASE任务开发的全套工具与音频材料开源共享。