This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.
翻译:摘要:本文提出NOMAD(非匹配音频距离),一种可微分的感知相似性度量指标,用于衡量退化信号与非匹配参考信号之间的距离。该方法通过基于神经图相似性指数度量(NSIM)引导的三元组损失学习深度特征嵌入,以捕捉退化强度。在推理阶段,任意两个音频样本之间的相似性通过其嵌入的欧氏距离计算。NOMAD完全无监督,可广泛应用于通用感知音频任务,例如音频分析中的质量评估以及生成任务(如语音增强与语音合成)。该方法在三个任务上进行了评估:退化强度排序、语音质量预测以及作为语音增强的损失函数。结果表明,NOMAD在退化强度排序与质量评估上优于其他非匹配参考方法,并与全参考音频指标展现出竞争性性能。NOMAD展示了一项有前景的技术,通过模仿人类利用非匹配参考评估音频质量的能力,学习感知嵌入,而无需人工标注。