Unraveling overoptimism and publication bias in ML-driven science

Machine Learning (ML) is increasingly used across many disciplines with impressive reported results across many domain areas. However, recent studies suggest that the published performance of ML models are often overoptimistic and not reflective of true accuracy were these models to be deployed. Validity concerns are underscored by findings of a concerning inverse relationship between sample size and reported accuracy in published ML models across several domains. This is in contrast with the theory of learning curves in ML, where we expect accuracy to improve or stay the same with increasing sample size. This paper investigates the factors contributing to overoptimistic accuracy reports in ML-based science, focusing on data leakage and publication bias. Our study introduces a novel stochastic model for observed accuracy, integrating parametric learning curves and the above biases. We then construct an estimator based on this model that corrects for these biases in observed data. Theoretical and empirical results demonstrate that this framework can estimate the underlying learning curve that gives rise to the observed overoptimistic results, thereby providing more realistic performance assessments of ML performance from a collection of published results. We apply the model to various meta-analyses in the digital health literature, including neuroimaging-based and speech-based classifications of several neurological conditions. Our results indicate prevalent overoptimism across these fields and we estimate the inherent limits of ML-based prediction in each domain.

翻译：机器学习（ML）正日益广泛地应用于众多学科领域，并在多个领域取得了令人瞩目的成果。然而，近期研究表明，已发表的机器学习模型性能往往过于乐观，并不能反映这些模型实际部署时的真实准确率。多个领域的研究发现，已发表机器学习模型的样本量与报告准确率之间存在令人担忧的负相关关系，这进一步凸显了其有效性隐患。这与机器学习中的学习曲线理论相悖——理论上我们预期准确率应随样本量增加而提升或保持稳定。本文研究了导致基于机器学习的科学研究中出现过度乐观准确率报告的因素，重点关注数据泄露和发表偏倚问题。本研究引入了一个新颖的观测准确率随机模型，该模型整合了参数化学习曲线与上述偏倚机制。基于此模型，我们构建了一个能够校正观测数据中这些偏倚的估计器。理论与实证结果表明，该框架能够估计产生观测到的过度乐观结果的基础学习曲线，从而通过已发表成果的集合为机器学习性能提供更现实的评估。我们将该模型应用于数字健康文献中的多项荟萃分析，包括基于神经影像和基于语音的多种神经系统疾病分类研究。研究结果表明这些领域普遍存在过度乐观现象，并据此估算了各领域中基于机器学习的预测所固有的性能上限。