Machine Learning (ML) is increasingly used across many disciplines with impressive reported results. However, recent studies suggest published performance of ML models are often overoptimistic. Validity concerns are underscored by findings of an inverse relationship between sample size and reported accuracy in published ML models, contrasting with the theory of learning curves where accuracy should improve or remain stable with increasing sample size. This paper investigates factors contributing to overoptimism in ML-driven science, focusing on overfitting and publication bias. We introduce a novel stochastic model for observed accuracy, integrating parametric learning curves and the aforementioned biases. We construct an estimator that corrects for these biases in observed data. Theoretical and empirical results show that our framework can estimate the underlying learning curve, providing realistic performance assessments from published results. Applying the model to meta-analyses of classifications of neurological conditions, we estimate the inherent limits of ML-based prediction in each domain.
翻译:机器学习(ML)正日益广泛地应用于众多学科领域,并展现出令人瞩目的报告结果。然而,近期研究表明,已发表的机器学习模型性能往往过于乐观。有效性担忧突出体现在已发表机器学习模型中样本量与报告准确率呈现负相关关系,这与学习曲线理论相悖——根据该理论,准确率应随样本量增加而提升或保持稳定。本文探究了导致机器学习驱动科学中出现过度乐观的因素,重点关注过拟合与发表偏倚。我们提出了一种新颖的观测准确率随机模型,该模型整合了参数化学习曲线与上述偏倚。我们构建了一种估计量,用于校正观测数据中的这些偏倚。理论与实证结果表明,我们的框架能够估计潜在学习曲线,从而基于已发表结果提供真实的性能评估。将该模型应用于神经系统疾病分类的荟萃分析后,我们估算了各领域中基于机器学习的预测所固有的性能极限。