The Limits of Assumption-free Tests for Algorithm Performance

Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm $A$ at the problem of learning from a training set of size $n$, versus, how good is a particular fitted model produced by running $A$ on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm $A$ as a ``black box'' (i.e., we can only study the behavior of $A$ empirically), there is a fundamental limit on our ability to carry out inference on the performance of $A$, unless the number of available data points $N$ is many times larger than the sample size $n$ of interest. (On the other hand, evaluating the performance of a particular fitted model is easy as long as a holdout data set is available -- that is, as long as $N-n$ is not too small.) We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that this is not the case: the same hardness result still holds for the problem of evaluating the performance of $A$, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.

翻译：算法评估与比较是机器学习与统计学中的基础问题——即算法在特定建模任务上的表现如何，以及哪种算法表现最优。已有多种方法用于评估算法性能，这些方法通常基于交叉验证策略，通过在数据的不同子集上重新训练目标算法，并评估其在留出数据点上的表现。尽管此类流程被广泛应用，但其理论性质尚未被充分理解。本研究探讨了在有限数据量下回答这些问题的若干根本限制。具体而言，我们区分了两个问题：算法A从大小为n的训练集进行学习的效果如何，以及算法A在特定大小为n的训练数据集上运行所产生的特定拟合模型的效果如何？我们的主要结果表明，对于任何将算法A视为“黑箱”（即我们仅能通过经验方式研究A的行为）的测试，除非可用数据点数量N远大于感兴趣样本量n，否则我们对A性能进行推断的能力存在根本限制。（另一方面，只要存在留出数据集——即只要N-n不是太小——评估特定拟合模型的性能则较为容易。）我们还探究了算法稳定性假设是否足以回避这一困难结果。令人惊讶的是，我们发现情况并非如此：除了拟合模型本质上非随机的高稳定性区域外，评估A性能的问题仍面临相同的困难结果。最后，我们为多算法比较问题建立了类似的困难结果。