The challenge of reproducible ML: an empirical study on the impact of bugs

Reproducibility is a crucial requirement in scientific research. When results of research studies and scientific papers have been found difficult or impossible to reproduce, we face a challenge which is called reproducibility crisis. Although the demand for reproducibility in Machine Learning (ML) is acknowledged in the literature, a main barrier is inherent non-determinism in ML training and inference. In this paper, we establish the fundamental factors that cause non-determinism in ML systems. A framework, ReproduceML, is then introduced for deterministic evaluation of ML experiments in a real, controlled environment. ReproduceML allows researchers to investigate software configuration effects on ML training and inference. Using ReproduceML, we run a case study: investigation of the impact of bugs inside ML libraries on performance of ML experiments. This study attempts to quantify the impact that the occurrence of bugs in a popular ML framework, PyTorch, has on the performance of trained models. To do so, a comprehensive methodology is proposed to collect buggy versions of ML libraries and run deterministic ML experiments using ReproduceML. Our initial finding is that there is no evidence based on our limited dataset to show that bugs which occurred in PyTorch do affect the performance of trained models. The proposed methodology as well as ReproduceML can be employed for further research on non-determinism and bugs.

翻译：当发现难以或无法复制研究成果和科学论文的结果时,我们面临被称为可复制危机的挑战。尽管文献中承认机器学习(ML)的再复制需求,但主要障碍是ML培训和推论中固有的非决定性因素。在本文中,我们确定了造成ML系统非不确定性的根本因素。然后引入了一个框架ReproduceML,用于在现实的、受控制的环境下对ML实验进行确定性评价。ReprocudesML允许研究人员调查软件配置对ML培训和推断的影响。我们使用ReprocuceML进行案例研究:调查ML图书馆中的错误对ML实验绩效的影响。本研究试图量化流行的ML框架中的错误发生对经过培训的模型的绩效的影响。为了做到这一点,我们建议采用一种综合方法,收集ML图书馆的错误版本,并用不成熟的MLML实验来阻止ML的软件对ML产生影响。我们所培训的MLML方法,我们的初步发现,在进行不完善的模型中,我们用这种方法来显示我们所使用的业绩。