Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.
翻译:机器学习方法通常通过其在公共数据集上的表现进行评估与比较。这使得多种方法(有时多达数千种)得以在相同条件下跨时间被评估。某一问题上排名最高的表现被称为当前最优(SOTA)性能,并作为新方法发表的参考标准之一。将最高排名表现作为SOTA的估计值是一种有偏估计,会产生过于乐观的结果。其内在机制是多重性问题——这一主题在多重比较与多重检验领域已有深入研究,但据作者所知,在关于SOTA估计的讨论中几乎未被涉及。这种乐观的当前最优估计被用作评估新方法的标准,而性能显著较差的方法容易被忽视。本文针对多个分类器的情况给出了概率分布,以便应用已知分析方法并提供更优的SOTA估计。我们通过一个独立分类器的模拟示例展示了多重性的影响,并揭示了分类器依赖性对方差的影响,同时指出当准确率较高时这种影响有限。最后,我们讨论了一个实际案例——2020年的Kaggle竞赛。