Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.
翻译:机器学习方法通常通过其在公共数据集上的表现进行评估和比较。这使得多种方法(往往多达数千种)能够在相同条件下跨时间进行评估。某一问题上排名最高的性能被称为当前最优(SOTA)性能,并作为新方法发表时的参考标准之一。然而,将排名最高的性能作为SOTA的估计值存在偏差,导致结果过于乐观。这种机制的本质是多重性问题——该问题在多重比较和多重假设检验领域已有深入研究,但据作者所知,在关于SOTA估计的讨论中几乎未曾涉及。这种乐观的SOTA估计被用作新方法的评估标准,导致性能明显较差的方法容易被忽视。本文针对多个分类器的情况给出了概率分布,使得已知的分析方法得以应用,并能提供更优的SOTA估计。我们通过一个独立分类器的模拟示例展示了多重性的影响,并揭示了分类器相关性对方差的影响,同时表明当准确率较高时这种影响有限。最后,我们讨论了一个真实案例:2020年的Kaggle竞赛。