LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs' capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs' debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future.DSDBench is publicly available at https://github.com/KevinCL16/DSDBench.
翻译:大语言模型正在变革软件开发,然而当前的代码生成与代码修复基准主要评估简单、单一错误情况下的语法与功能正确性。大语言模型在复杂数据科学代码中自主发现并修复运行时逻辑错误的能力在很大程度上仍未得到充分探索。为填补这一空白,我们提出了DSDBench:数据科学调试基准,这是首个系统评估大语言模型在数据科学代码调试中处理多跳错误追踪与多缺陷检测能力的基准。DSDBench改编自现有数据科学任务基准(如DABench和MatPlotBench)的数据集,以自动合成的多跳、多缺陷代码片段为特征,呈现真实的数据科学调试任务。DSDBench包含1,117个标注样本,涵盖741组因果错误对及运行时错误信息。在DSDBench上对前沿大语言模型的评估显示出显著的性能差距,突显了调试数据科学代码中运行时逻辑错误所面临的挑战。DSDBench为评估和改进大语言模型的调试与推理能力提供了关键资源,有望推动未来更可靠的AI辅助数据科学发展。DSDBench已公开发布于https://github.com/KevinCL16/DSDBench。