Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17\% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.
翻译:语音分离是诸多语音处理任务(如自动语音识别)中至关重要的预处理步骤。传统语音分离评估指标依赖匹配的参考音频及对应转写文本以评估音频质量与可懂度,然而这些方法无法用于评估现实中不存在参考数据的混合音频。本文提出一种基于自监督学习表征的无文本无参考评估框架。该框架利用混合音轨与分离音轨,通过尺度不变信噪比指标联合预测音频质量,并通过词错误率指标预测语音可懂度。我们在WHAMR!数据集上进行了实验,结果显示词错误率估计的平均绝对误差为17%,皮尔逊相关系数为0.77;尺度不变信噪比估计的平均绝对误差为1.38,皮尔逊相关系数为0.95。通过采用多种自监督学习表征,我们进一步验证了所提评估器的鲁棒性。