Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

We propose a robust methodology to evaluate the performance and computational efficiency of non-parametric two-sample tests, specifically designed for high-dimensional generative models in scientific applications such as in particle physics. The study focuses on tests built from univariate integral probability measures: the sliced Wasserstein distance and the mean of the Kolmogorov-Smirnov statistics, already discussed in the literature, and the novel sliced Kolmogorov-Smirnov statistic. These metrics can be evaluated in parallel, allowing for fast and reliable estimates of their distribution under the null hypothesis. We also compare these metrics with the recently proposed unbiased Fr\'echet Gaussian Distance and the unbiased quadratic Maximum Mean Discrepancy, computed with a quartic polynomial kernel. We evaluate the proposed tests on various distributions, focusing on their sensitivity to deformations parameterized by a single parameter $\epsilon$. Our experiments include correlated Gaussians and mixtures of Gaussians in 5, 20, and 100 dimensions, and a particle physics dataset of gluon jets from the JetNet dataset, considering both jet- and particle-level features. Our results demonstrate that one-dimensional-based tests provide a level of sensitivity comparable to other multivariate metrics, but with significantly lower computational cost, making them ideal for evaluating generative models in high-dimensional settings. This methodology offers an efficient, standardized tool for model comparison and can serve as a benchmark for more advanced tests, including machine-learning-based approaches.

翻译：本研究提出了一种稳健的方法论，用于评估非参数双样本检验的性能与计算效率，该方法专门针对科学应用（如粒子物理）中的高维生成模型而设计。研究聚焦于基于单变量积分概率度量构建的检验方法：已在文献中讨论过的切片瓦瑟斯坦距离与柯尔莫哥洛夫-斯米尔诺夫统计量均值，以及新提出的切片柯尔莫哥洛夫-斯米尔诺夫统计量。这些度量指标可并行计算，从而能够快速可靠地估计其在零假设下的分布。我们还将这些指标与近期提出的无偏弗雷歇高斯距离及采用四次多项式核计算的无偏二次最大均值差异进行比较。我们在多种分布上评估所提出的检验方法，重点关注其对由单参数$\epsilon$参数化变形的敏感性。实验涵盖5维、20维和100维的相关高斯分布与高斯混合分布，以及来自JetNet数据集的胶子喷注粒子物理数据集，同时考虑喷注层级和粒子层级的特征。结果表明，基于一维的检验方法能提供与其他多元度量相当的敏感性水平，但计算成本显著降低，使其成为高维场景下评估生成模型的理想选择。该方法为模型比较提供了高效标准化的工具，并可作为更高级检验方法（包括基于机器学习的方法）的基准。