Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet few debugging tools are designed for ML code in notebooks, potentially due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench has 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verifiable fix, ranging over popular ML libraries, including TensorFlow/Keras, PyTorch, Scikit-learn, Pandas, and NumPy, as well as notebook-specific out-of-order execution issue. To support reproducibility and ease of use, JunoBench offers a unified execution environment where crashes and fixes can be reliably reproduced. By providing realistic crashes and their resolutions, JunoBench facilitates bug detection, localization, diagnosis, and repair tailored to the interactive and iterative nature of notebook-based ML development.
翻译:Jupyter notebook被广泛用于机器学习(ML)原型开发。然而,针对notebook中ML代码的调试工具却很少,这可能是由于缺乏基准数据集所致。我们提出了JunoBench,这是首个基于Python的ML notebook中真实崩溃的基准数据集。JunoBench包含从公开Kaggle notebook中精心挑选的111个可复现的崩溃案例,每个案例都配有可验证的修复方案,涵盖了包括TensorFlow/Keras、PyTorch、Scikit-learn、Pandas和NumPy在内的流行ML库,以及notebook特有的乱序执行问题。为支持可复现性和易用性,JunoBench提供了一个统一的执行环境,可可靠地复现崩溃和修复。通过提供真实的崩溃案例及其解决方案,JunoBench有助于针对基于notebook的ML开发的交互式和迭代特性,进行错误检测、定位、诊断和修复。