This paper explores whether considering alternative domain-specific embeddings to calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds from the DCASE 2023 Task 7 dataset. Using perceptual data from the same task, we find that PANNs-WGM-LogMel produces the best correlation between FAD scores and perceptual ratings of both audio quality and perceived fit with a Spearman correlation higher than 0.5. We also find that music-specific embeddings resulted in significantly lower results. Interestingly, VGGish, the embedding used for the original Fr\'echet calculation, yielded a correlation below 0.1. These results underscore the critical importance of the choice of embedding for the FAD metric design.
翻译:本文探讨了使用替代领域特定嵌入计算Fréchet音频距离(FAD)指标是否有助于提升FAD与环境声音感知评分的相关性。我们采用了针对音乐或环境声音评估定制的VGGish、PANNs、MS-CLAP、L-CLAP及MERT嵌入,基于DCASE 2023任务7数据集计算FAD分值。结合该任务的感知数据,我们发现PANNs-WGM-LogMel在音频质量与感知适配度的FAD分数与人类感知评分之间取得了最佳相关性,斯皮尔曼相关系数超过0.5。同时,音乐专用嵌入方法显著降低了相关表现。值得注意的是,原始Fréchet计算所用的VGGish嵌入得到的相关系数低于0.1。这些结果凸显了嵌入选择对FAD指标设计的关键重要性。