Although being widely adopted for evaluating generated audio signals, the Fr\'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.
翻译:尽管弗雷歇音频距离(FAD)已被广泛用于评估生成的音频信号,但其存在显著局限性,包括对高斯假设的依赖、对样本量的敏感性以及较高的计算复杂度。作为替代方案,我们引入了核音频距离(KAD),这是一种基于最大均值差异(MMD)的新型、无分布、无偏且计算高效的度量标准。通过分析和实证验证,我们证明了KAD的优势:(1)在较小样本量下具有更快的收敛速度,能够在有限数据下实现可靠评估;(2)计算成本更低,并支持可扩展的GPU加速;(3)与人类感知判断具有更强的一致性。通过利用先进的嵌入和特征核,KAD能够捕捉真实音频与生成音频之间的细微差异。KAD已在kadtk工具包中开源,为评估生成式音频模型提供了一个高效、可靠且符合感知的基准。