In the field of audio generation, signal-to-noise ratio (SNR) has long served as an objective metric for evaluating audio quality. Nevertheless, recent studies have shown that SNR and its variants are not always highly correlated with human perception, prompting us to raise the questions: Why does SNR fail in measuring audio quality? And how to improve its reliability as an objective metric? In this paper, we identify the inadequate measurement of phase distance as a pivotal factor and propose to reformulate SNR with specially designed phase-distance terms, yielding an improved metric named GOMPSNR. We further extend the newly proposed formulation to derive two novel categories of loss function, corresponding to magnitude-guided phase refinement and joint magnitude-phase optimization, respectively. Besides, extensive experiments are conducted for an optimal combination of different loss functions. Experimental results on advanced neural vocoders demonstrate that our proposed GOMPSNR exhibits more reliable error measurement than SNR. Meanwhile, our proposed loss functions yield substantial improvements in model performance, and our wellchosen combination of different loss functions further optimizes the overall model capability.
翻译:在音频生成领域,信噪比(SNR)长期以来作为评估音频质量的客观度量指标。然而,近期研究表明,SNR及其变体与人类感知并非总是高度相关,这促使我们提出以下问题:为何SNR在衡量音频质量时失效?如何提升其作为客观度量的可靠性?本文指出相位距离测量不足是关键因素,并提出通过特殊设计的相位距离项重构SNR,从而得到改进的度量指标GOMPSNR。我们进一步扩展新提出的公式,推导出两类新型损失函数,分别对应幅度引导的相位优化和幅度-相位联合优化。此外,通过大量实验探索不同损失函数的最佳组合。在先进神经声码器上的实验结果表明,我们提出的GOMPSNR比SNR具有更可靠的误差测量能力。同时,我们提出的损失函数显著提升了模型性能,而精心选择的不同损失函数组合进一步优化了模型的整体能力。