Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ($4\to 16$~kHz) and full-band ($16\to 48$~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.
翻译:生成对抗网络(GAN)和扩散模型近期在音频超分辨率(ADSR)领域取得了最先进的性能,能够从窄带输入生成感知上可信的宽带音频。然而,现有评估主要依赖于信号级或感知指标,尚未解决合成超分辨率音频与真实宽带音频的分布匹配程度问题。本文通过分析真实音频与超分辨率音频在不同嵌入空间中的可分离性来探讨此问题。我们考虑了语音和音乐的中频带($4\to 16$~kHz)与全频带($16\to 48$~kHz)上采样任务,训练线性分类器基于多种类型的音频嵌入来区分真实样本与合成样本。与客观指标和主观听音测试的比较表明,即使生成的音频具有高感知质量和最先进的指标得分,基于嵌入的分类器仍能实现近乎完美的区分。这一现象在不同数据集和模型(包括近期基于扩散的方法)中均保持一致,凸显了ADSR模型在感知质量与真实分布保真度之间存在的持续差距。