Interactive computational notebooks (e.g., Jupyter notebooks) are widely used in machine learning engineering (MLE) to program and share end-to-end pipelines, from data preparation to model training and evaluation. However, environment erosion-the rapid evolution of hardware and software ecosystems for machine learning-has rendered many published MLE notebooks non-reproducible in contemporary environments, hindering code reuse and scientific progress. To quantify this gap, we study 12,720 notebooks mined from 79 popular Kaggle competitions: only 35.4% remain reproducible today. Crucially, we find that environment backporting, i.e., downgrading dependencies to match the submission time, does not improve reproducibility but rather introduces additional failure modes. To address environment erosion, we design and implement MLEModernizer, an LLM-driven agentic framework that treats the contemporary environment as a fixed constraint and modernizes notebook code to restore reproducibility. MLEModernizer iteratively executes notebooks, collects execution feedback, and applies targeted fixes in three types: error-repair, runtime-reduction, and score-calibration. Evaluated on 7,402 notebooks that are non-reproducible under the baseline environment, MLEModernizer makes 5,492 (74.2%) reproducible. MLEModernizer enables practitioners to validate, reuse, and maintain MLE artifacts as the hardware and software ecosystems continue to evolve.
翻译:交互式计算笔记本(例如Jupyter笔记本)在机器学习工程中广泛用于编程和共享端到端流水线,涵盖从数据准备到模型训练与评估的全过程。然而,环境侵蚀——机器学习硬件与软件生态系统的快速演进——已导致许多已发布的机器学习工程笔记本在当代环境中无法复现,阻碍了代码重用与科学进展。为量化这一差距,我们研究了从79场热门Kaggle竞赛中挖掘的12,720个笔记本:目前仅有35.4%保持可复现性。关键的是,我们发现环境回退(即降级依赖项以匹配提交时的版本)不仅无法提升可复现性,反而会引入额外的故障模式。为解决环境侵蚀问题,我们设计并实现了MLEModernizer,这是一个以大型语言模型驱动的智能体框架,它将当代环境视为固定约束,通过现代化改造笔记本代码以恢复可复现性。MLEModernizer迭代执行笔记本,收集执行反馈,并实施三类针对性修复:错误修复、运行时优化与分数校准。在基线环境下不可复现的7,402个笔记本上评估,MLEModernizer成功使5,492个(74.2%)笔记本恢复可复现性。MLEModernizer使从业者能够在硬件与软件生态系统持续演进的过程中,有效验证、重用和维护机器学习工程制品。