Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.
翻译:通过在Python执行轨迹上训练大语言模型(LLM),可使其基于代码执行过程,实现对完整Python程序的逐行执行预测,从而将其转化为神经解释器(FAIR CodeGen Team等人,2025年)。然而,开发者很少逐步执行程序;相反,他们使用调试器在特定断点处暂停执行,仅单步跟踪相关代码段,同时检查或修改变量值。现有的神经解释器方法缺乏此类交互控制能力。为突破这一局限,我们提出神经调试器:一种模拟传统调试器的语言模型,支持步入、步过、步出函数等操作,并能在特定源代码行设置断点。研究表明,通过微调大型LLM或从头预训练较小模型获得的神经调试器,能够可靠地建模正向执行(预测未来状态与输出)与逆向执行(推断先前状态或输入),且该建模过程以调试器操作为条件。在CruxEval基准上的评估显示,我们的模型在输出预测与输入预测任务中均表现优异,展现了稳健的条件执行建模能力。本研究为未来智能编码系统迈出关键一步:神经调试器可作为模拟调试环境的世界模型,提供执行反馈或使智能体能够与真实调试工具交互。该能力为更强大的代码生成、程序理解与自动化调试奠定了基石。