The Limits of Assumption-free Tests for Algorithm Performance

Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm $A$ at the problem of learning from a training set of size $n$, versus, how good is a particular fitted model produced by running $A$ on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm $A$ as a ``black box'' (i.e., we can only study the behavior of $A$ empirically), there is a fundamental limit on our ability to carry out inference on the performance of $A$, unless the number of available data points $N$ is many times larger than the sample size $n$ of interest. (On the other hand, evaluating the performance of a particular fitted model is easy as long as a holdout data set is available -- that is, as long as $N-n$ is not too small.) We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that this is not the case: the same hardness result still holds for the problem of evaluating the performance of $A$, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.

翻译：算法评估与比较是机器学习和统计学中的基本问题——如何衡量一个算法在特定建模任务上的性能，以及哪个算法表现最优？已有许多方法被开发用于评估算法性能，通常基于交叉验证类策略，即在数据的不同子集上重新训练目标算法，并在保留数据点上评估其表现。尽管此类方法被广泛应用，但其理论性质尚未被完全理解。本研究探索了在有限数据条件下回答这些问题的若干根本性局限。具体而言，我们区分了两个问题：算法$A$在从规模为$n$的训练集进行学习时的性能如何，以及运行$A$于特定规模为$n$的训练数据集后产生的特定拟合模型性能如何？我们的主要结果表明，对于任何将算法$A$视为“黑箱”（即只能通过经验方式研究$A$行为）的检验方法，我们对$A$性能进行推断的能力存在根本性局限，除非可用数据点数量$N$远大于所关注的样本量$n$。（另一方面，评估特定拟合模型的性能则相对简单，只要存在保留数据集——即$N-n$值不至于过小。）我们还进一步探究，采用算法稳定性假设是否足以规避这一困难结果。令人惊讶的是，我们发现情况并非如此：除了拟合模型本质上非随机的高稳定性区域外，相同的困难结果依然适用于评估$A$性能的问题。最后，我们还为多算法比较问题建立了类似的困难结果。