This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.
翻译:摘要:本文提出NOMAD(非匹配音频距离),一种可微分的感知相似度度量,用于测量退化信号与非匹配参考之间的距离。该方法基于通过神经图相似度指数(NSIM)引导的三元组损失学习深度特征嵌入,以捕捉退化强度。推理时,任意两个音频样本之间的相似度通过其嵌入的欧氏距离计算。NOMAD完全无监督,可应用于通用感知音频任务,如音频分析(例如质量评估)和生成任务(例如语音增强与语音合成)。该方法在三个任务上进行了评估:退化强度排序、语音质量预测以及作为语音增强的损失函数。结果表明,NOMAD在退化强度排序与质量评估方面均优于其他非匹配参考方法,且与全参考音频指标具有竞争性表现。NOMAD展示了一种有前景的技术,能够模仿人类利用非匹配参考评估音频质量的能力,从而无需人工标注即可学习感知嵌入。