Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
翻译:代码大语言模型(LLM)通过直接基于错误代码片段生成正确代码,在代码调试领域取得了显著进展。编程基准测试通常包含错误代码片段及其关联测试用例,用于评估LLM的调试能力。然而,现有基准测试大多聚焦于Python,在语言多样性方面存在局限(例如DebugBench和DebugEval)。为推进LLM多语言调试研究,我们提出了首个大规模多语言调试基准测试,涵盖18种编程语言的3.6K测试样本,包含自动程序修复(APR)、代码审查(CR)和缺陷识别(BI)三类任务。进一步地,我们通过向正确的多语言查询与解决方案中注入缺陷(xDebugGen),构建了调试指令数据集MDEVAL-INSTRUCT。基于该数据集训练的多语言调试模型xDebugCoder可作为强基线模型,专门处理多种编程语言的缺陷类型(例如Rust语言的"Missing Mut"错误和C语言的"Misused Macro Definition"错误)。我们在MDEVAL上的大量实验表明,开源模型与闭源LLM(如GPT和Claude系列)之间存在显著性能差距,这揭示了多语言代码调试场景仍存在巨大的改进空间。