Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by an inability to reliably evaluate the quality of their generated samples. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics: Clipped Density and Clipped Coverage. By clipping individual sample contributions, as well as the radii of nearest neighbor balls for fidelity, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics demonstrate linear score degradation as the proportion of bad samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability when evaluating generative models.
翻译:尽管生成模型近年来取得了显著进展,但其在关键应用中的部署仍受限于生成样本质量评估的可靠性问题。质量至少包含两个互补概念:保真度与覆盖度。现有质量度量方法常因缺乏校准或对异常值鲁棒性不足,导致其数值缺乏可靠且可解释的特性。为克服这些缺陷,本文提出两个新度量指标:截断密度与截断覆盖度。通过截断单个样本贡献值以及保真度计算中最近邻球的半径,我们的方法能有效防止分布外样本对聚合值的偏差影响。经解析与实验校准验证,这些度量指标在劣质样本比例增加时呈现线性分数衰减特性,因此可直接解释为优质样本的等效比例。在合成与真实数据集上的大量实验表明,在评估生成模型时,截断密度与截断覆盖度在鲁棒性、敏感性与可解释性方面均优于现有方法。