With the proliferation of generative AI and the increasing volume of generative data (also called as synthetic data), assessing the fidelity of generative data has become a critical concern. In this paper, we propose a discriminative approach to estimate the total variation (TV) distance between two distributions as an effective measure of generative data fidelity. Our method quantitatively characterizes the relation between the Bayes risk in classifying two distributions and their TV distance. Therefore, the estimation of total variation distance reduces to that of the Bayes risk. In particular, this paper establishes theoretical results regarding the convergence rate of the estimation error of TV distance between two Gaussian distributions. We demonstrate that, with a specific choice of hypothesis class in classification, a fast convergence rate in estimating the TV distance can be achieved. Specifically, the estimation accuracy of the TV distance is proven to inherently depend on the separation of two Gaussian distributions: smaller estimation errors are achieved when the two Gaussian distributions are farther apart. This phenomenon is also validated empirically through extensive simulations. In the end, we apply this discriminative estimation method to rank fidelity of synthetic image data using the MNIST dataset.
翻译:随着生成式人工智能的普及和生成数据(亦称合成数据)规模的日益增长,评估生成数据的保真度已成为关键问题。本文提出一种判别式方法来估计两个分布之间的总变差(TV)距离,以此作为生成数据保真度的有效度量。我们的方法定量刻画了分类两个分布的贝叶斯风险与其总变差距离之间的关系,从而将总变差距离的估计问题转化为贝叶斯风险的估计问题。特别地,本文建立了关于两个高斯分布间总变差距离估计误差收敛速率的理论结果。我们证明,通过选择特定的分类假设类,可以实现总变差距离估计的快速收敛速率。具体而言,总变差距离的估计精度被证明本质上取决于两个高斯分布的分离程度:当两个高斯分布相距越远时,获得的估计误差越小。这一现象也通过大量仿真实验得到了实证验证。最后,我们应用该判别式估计方法,基于MNIST数据集对合成图像数据的保真度进行排序。