We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution $p$, extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor's quality, measured by its total variation distance from $p$. A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation's accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.
翻译:我们考虑离散分布的假设检验问题。在标准模型中,我们通过对基础分布 $p$ 进行采样访问,已有大量研究为均匀性测试、同一性测试(拟合优度)以及相似性测试(等价性或双样本测试)建立了最优边界。我们在一个可获得预测数据分布(可能源自历史数据或预测性机器学习模型)的场景下探讨这些问题。我们证明,这样的预测器确实能够减少所有三种属性测试任务所需的样本数量。样本复杂度的降低直接取决于预测器的质量,以其与 $p$ 的总变差距离衡量。我们算法的关键优势在于其能够适应预测的精度。具体而言,我们的算法能够根据可用预测的准确性自我调整其样本复杂度,在无需任何关于估计精度先验知识的情况下运行(即它们是一致的)。此外,即使预测未提供任何有意义的信息,我们的算法也绝不会比标准方法需要更多样本(即它们也是鲁棒的)。我们提供了下界以表明,我们的算法所实现的样本复杂度改进在信息论意义下是最优的。此外,实验结果表明,我们的算法在真实数据上的性能显著超过了我们对样本复杂度的最坏情况保证,从而证明了我们方法的实用性。