Deep generative models have achieved great success in producing high-quality samples, making them a central tool across machine learning applications. Beyond sample quality, an important yet less systematically studied question is whether trained generative models faithfully capture the diversity of the underlying data distribution. In this work, we address this question by directly comparing the diversity of samples generated by state-of-the-art models with that of test samples drawn from the target data distribution, using recently proposed reference-free entropy-based diversity scores, Vendi and RKE. Across multiple benchmark datasets, we find that test data consistently attains substantially higher Vendi and RKE diversity scores than the generated samples, suggesting a systematic downward diversity bias in modern generative models. To understand the origin of this bias, we analyze the finite-sample behavior of entropy-based diversity scores and show that their expected values increase with sample size, implying that diversity estimated from finite training sets could inherently underestimate the diversity of the true distribution. As a result, optimizing the generators to minimize divergence to empirical data distributions would induce a loss of diversity. Finally, we discuss potential diversity-aware regularization and guidance strategies based on Vendi and RKE as principled directions for mitigating this bias, and provide empirical evidence suggesting their potential to improve the results.
翻译:深度生成模型在生成高质量样本方面取得了巨大成功,使其成为机器学习应用中的核心工具。除了样本质量之外,一个重要但系统性研究较少的问题是:训练好的生成模型是否忠实地捕捉了底层数据分布的多样性。在本工作中,我们通过直接比较最先进模型生成的样本与从目标数据分布中抽取的测试样本的多样性来探讨这个问题,使用的是最近提出的基于熵的无参考多样性评分方法 Vendi 和 RKE。在多个基准数据集上,我们发现测试数据始终获得比生成样本高得多的 Vendi 和 RKE 多样性分数,这表明现代生成模型存在系统性的向下多样性偏差。为了理解这种偏差的根源,我们分析了基于熵的多样性分数的有限样本行为,并证明其期望值随样本量的增加而增加,这意味着从有限训练集估计的多样性可能固有地低估了真实分布的多样性。因此,优化生成器以最小化与经验数据分布的散度会导致多样性的损失。最后,我们讨论了基于 Vendi 和 RKE 的潜在多样性感知正则化和引导策略,作为缓解这种偏差的原则性方向,并提供了经验证据表明它们具有改善结果的潜力。