Within the area of speech enhancement, there is an ongoing interest in the creation of neural systems which explicitly aim to improve the perceptual quality of the processed audio. In concert with this is the topic of non-intrusive (i.e. without clean reference) speech quality prediction, for which neural networks are trained to predict human-assigned quality labels directly from distorted audio. When combined, these areas allow for the creation of powerful new speech enhancement systems which can leverage large real-world datasets of distorted audio, by taking inference of a pre-trained speech quality predictor as the sole loss function of the speech enhancement system. This paper aims to identify a potential pitfall with this approach, namely hallucinations which are introduced by the enhancement system `tricking' the speech quality predictor.
翻译:在语音增强领域,研究者持续关注开发旨在显式提升处理后音频感知质量的神经网络系统。与此并行发展的是非侵入式(即无需纯净参考信号)语音质量预测技术,该技术通过训练神经网络直接从失真音频中预测人工标注的质量分数。将这两个领域相结合,能够创建强大的新型语音增强系统:这类系统可通过将预训练语音质量预测器的推断结果作为增强系统的唯一损失函数,从而利用大规模真实世界失真音频数据集。本文旨在揭示该方法可能存在的缺陷,即增强系统通过"欺骗"语音质量预测器而产生的幻觉现象。