Hypothesis testing in high dimensional data is a notoriously difficult problem without direct access to competing models' likelihood functions. This paper argues that statistical divergences can be used to quantify the difference between the population distributions of observed data and competing models, justifying their use as the basis of a hypothesis test. We go on to point out how modern techniques for functional optimization let us estimate many divergences, without the need for population likelihood functions, using samples from two distributions alone. We use a physics-based example to show how the proposed two-sample test can be implemented in practice, and discuss the necessary steps required to mature the ideas presented into an experimental framework. The code used has been made available for others to use.
翻译:高维数据中的假设检验是一个众所周知的难题,因为无法直接获取竞争模型的似然函数。本文论证了统计散度可用于量化观测数据与竞争模型的总体分布之间的差异,从而证明其作为假设检验基础的合理性。我们进一步指出,现代函数优化技术如何使我们能够仅利用两个分布的样本估计多种散度,而无需总体似然函数。我们通过一个基于物理学的示例展示了所提出的双样本检验在实践中如何实施,并讨论了将所提思想发展为实验框架所需的必要步骤。所用代码已开源供他人使用。