Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.
翻译:机器学习方法通常通过其在公共存储库数据集上的表现进行评估和比较。这使得多种方法(往往多达数千种)能够在相同条件下跨时间进行评估。某个问题中排名最高的性能被称为最先进水平(SOTA),并作为发布新方法的参考基准之一。将最高排名性能作为SOTA的估计是一种有偏估计,会给出过于乐观的结果。其背后的机制是多重性问题,这一主题在多重比较与多重检验的背景下已得到充分研究,但据作者所知,在关于SOTA估计的讨论中却几乎被忽视。这种乐观的SOTA估计被用作评估新方法的标准,而效果显著较差的方法则容易被忽略。本文针对多个分类器的情况提供了概率分布,以便采用已知分析方法获得更准确的SOTA估计。我们通过一个独立分类器的模拟示例展示了多重性的影响,并揭示了分类器依赖性对方差的影响,同时指出当精度较高时这种影响有限。最后,我们讨论了一个真实案例:2020年的一项Kaggle竞赛。