Jupyter notebooks are widely used for machine learning (ML) development due to their support for interactive and iterative experimentation. However, ML notebooks are highly prone to bugs, with crashes being among the most disruptive. Despite their practical importance, systematic methods for crash detection and diagnosis in ML notebooks remain largely unexplored. We present CRANE-LLM, a novel approach that augments large language models (LLMs) with structured runtime information extracted from the notebook kernel state to detect and diagnose crashes before executing a target cell. Given previously executed cells and a target cell, CRANE-LLM combines static code context with runtime information, including object types, tensor shapes, and data attributes, to predict whether the target cell will crash (detection) and explain the underlying cause (diagnosis). We evaluate CRANE-LLM on JunoBench, a benchmark of 222 ML notebooks comprising 111 pairs of crashing and corresponding non-crashing notebooks across multiple ML libraries and crash root causes. Across three state-of-the-art LLMs (Gemini, Qwen, and GPT-5), runtime information improves crash detection and diagnosis by 7-10 percentage points in accuracy and 8-11 in F1-score, with larger gains for diagnosis. Improvements vary across ML libraries, crash causes, and LLMs, and depends on the integration of complementary categories of runtime information.
翻译:Jupyter笔记本因其支持交互式与迭代式实验而广泛应用于机器学习开发。然而,机器学习笔记本极易出现错误,其中崩溃是最具破坏性的问题之一。尽管其实际重要性显著,针对机器学习笔记本崩溃检测与诊断的系统性方法仍鲜有探索。本文提出CRANE-LLM——一种创新方法,通过从笔记本内核状态提取结构化运行时信息来增强大型语言模型,从而在执行目标单元前检测并诊断崩溃。给定先前执行的单元与目标单元,CRANE-LLM将静态代码上下文与运行时信息(包括对象类型、张量形状及数据属性)相结合,以预测目标单元是否会发生崩溃(检测)并解释根本原因(诊断)。我们在JunoBench基准测试上评估CRANE-LLM,该基准包含222个机器学习笔记本,涵盖多个机器学习库与崩溃根本原因的111对崩溃及对应非崩溃笔记本。在三种前沿大型语言模型(Gemini、Qwen与GPT-5)上,运行时信息将崩溃检测与诊断的准确率提升7-10个百分点,F1分数提升8-11分,其中诊断任务提升更为显著。改进效果因机器学习库、崩溃原因及大型语言模型而异,且依赖于互补类别运行时信息的整合。