Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).
翻译:近期语音增强领域的研究探索了利用自监督语音表示来辅助神经语音增强模型的训练。然而,这类工作大多侧重于使用自监督语音表征模型的最深层或最终输出,而非早期的特征编码。这种对自监督表征的使用方式往往缺乏充分动机。本研究证明,干净语音与带噪语音特征编码之间的距离与心理声学驱动的语音质量及可懂度指标,以及人类平均意见得分(MOS)均呈强相关性。通过将该距离作为损失函数进行实验,结果表明其在感知语音质量评估(PESQ)和短时客观可懂度(STOI)等客观指标上的性能优于基于STFT频谱图距离的损失函数及其他语音增强文献中常见的损失函数。