We introduce the notion of a reproducible algorithm in the context of learning. A reproducible learning algorithm is resilient to variations in its samples -- with high probability, it returns the exact same output when run on two samples from the same underlying distribution. We begin by unpacking the definition, clarifying how randomness is instrumental in balancing accuracy and reproducibility. We initiate a theory of reproducible algorithms, showing how reproducibility implies desirable properties such as data reuse and efficient testability. Despite the exceedingly strong demand of reproducibility, there are efficient reproducible algorithms for several fundamental problems in statistics and learning. First, we show that any statistical query algorithm can be made reproducible with a modest increase in sample complexity, and we use this to construct reproducible algorithms for finding approximate heavy-hitters and medians. Using these ideas, we give the first reproducible algorithm for learning halfspaces via a reproducible weak learner and a reproducible boosting algorithm. Finally, we initiate the study of lower bounds and inherent tradeoffs for reproducible algorithms, giving nearly tight sample complexity upper and lower bounds for reproducible versus nonreproducible SQ algorithms.
翻译:我们提出在学习背景下可复现算法的概念。可复现学习算法对其样本的变异性具有鲁棒性——当从同一潜在分布中抽取两个样本时,该算法以高概率返回完全相同的结果。我们首先阐释这一定义,阐明随机性如何在平衡准确性与可复现性中发挥关键作用。我们开创了可复现算法的理论体系,证明可复现性如何推导出理想特性,如数据复用与高效可测试性。尽管可复现性的要求极为严格,但针对统计学与学习中的若干基本问题,仍存在高效的可复现算法。首先,我们证明任何统计查询算法均可在样本复杂度适度增加的情况下实现可复现,并利用此结果构建用于寻找近似重众数与中位数的可复现算法。基于这些思想,我们通过可复现弱学习器与可复现提升算法,首次给出用于学习半空间的可复现算法。最后,我们开创了可复现算法下界与固有权衡的研究,给出了可复现与非可复现SQ算法之间近乎紧致的样本复杂度上下界。