Background: Many published machine learning studies are irreproducible. Issues with methodology and not properly accounting for variation introduced by the algorithm themselves or their implementations are attributed as the main contributors to the irreproducibility.Problem: There exist no theoretical framework that relates experiment design choices to potential effects on the conclusions. Without such a framework, it is much harder for practitioners and researchers to evaluate experiment results and describe the limitations of experiments. The lack of such a framework also makes it harder for independent researchers to systematically attribute the causes of failed reproducibility experiments. Objective: The objective of this paper is to develop a framework that enable applied data science practitioners and researchers to understand which experiment design choices can lead to false findings and how and by this help in analyzing the conclusions of reproducibility experiments. Method: We have compiled an extensive list of factors reported in the literature that can lead to machine learning studies being irreproducible. These factors are organized and categorized in a reproducibility framework motivated by the stages of the scientific method. The factors are analyzed for how they can affect the conclusions drawn from experiments. A model comparison study is used as an example. Conclusion: We provide a framework that describes machine learning methodology from experimental design decisions to the conclusions inferred from them.
翻译:背景:许多已发表的机器学习研究结果无法复现。方法论问题以及未能恰当考虑算法本身或其实现引入的变异性,被认为是导致不可复现性的主要因素。问题:目前尚缺乏一个将实验设计选择与结论潜在影响联系起来的基本理论框架。缺乏这样的框架使得实践者和研究人员更难评估实验结果并描述实验的局限性,也使独立研究人员更难系统性地归因可复现性实验失败的原因。目标:本文旨在开发一个框架,使应用数据科学实践者和研究人员能够理解哪些实验设计选择会导致错误发现以及如何导致,从而帮助分析可复现性实验的结论。方法:我们汇编了文献中报道的可导致机器学习研究不可复现的广泛因素列表。这些因素按照科学方法阶段的启发在一个可复现性框架中进行组织和分类。分析这些因素如何影响从实验得出的结论。以模型比较研究为例进行说明。结论:我们提供了一个框架,描述从实验设计决策到从中推断结论的机器学习方法论。