Hypothesis testing in high dimensional data is a notoriously difficult problem without direct access to competing models' likelihood functions. This paper argues that statistical divergences can be used to quantify the difference between the population distributions of observed data and competing models, justifying their use as the basis of a hypothesis test. We go on to point out how modern techniques for functional optimization let us estimate many divergences, without the need for population likelihood functions, using samples from two distributions alone. We use a physics-based example to show how the proposed two-sample test can be implemented in practice, and discuss the necessary steps required to mature the ideas presented into an experimental framework.
翻译:高维数据中的假设检验是一个公认的难题,尤其是在无法直接获取竞争模型似然函数的情况下。本文论证了统计散度可用于量化观测数据总体分布与竞争模型总体分布之间的差异,从而证明将其作为假设检验基础的合理性。我们进一步指出,借助函数优化的现代技术,无需总体似然函数,仅通过两个分布的样本即可估计多种散度。本文通过一个基于物理学的实例展示了所提出的双样本检验在实际中的应用,并讨论了将这些思想发展为实验框架所需的关键步骤。