Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.
翻译:弗雷歇音频距离(FAD)是评估文本到音频生成的行业标准,但其得分依赖于底层编码器的嵌入空间。编码器的训练任务决定了哪些声学特征得以保留或丢弃,导致FAD继承了系统性的任务诱导偏差。我们将评估分解为召回率、精确度和对齐度(进一步细分为语义维度和结构维度),并采用对数尺度归一化以实现跨编码器的公平比较。在两个数据集上对六个编码器进行的受控实验揭示了一个四轴权衡:基于重构的AudioMAE在精确度敏感度上领先;经ASR训练的Whisper在结构检测中占据主导地位,但对信号退化不敏感;经分类训练的VGGish最大化语义检测,却惩罚了合法的类内变异。由于不存在通用的评估编码器,未来的评估指标必须转向与人类感知本质对齐的评估原生编码器。