TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43\% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61\% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.

翻译：大型语言模型（LLM）生成的代码常包含微妙但关键的错误，尤其在处理复杂任务时。现有的自动化修复方法通常依赖于表面的通过/失败信号，对程序行为的可见性有限，阻碍了精确的错误定位。此外，由于缺乏从先前失败中学习的能力，修复过程常陷入重复且低效的循环。为克服这些挑战，我们提出了TraceCoder，一个模拟人类专家“观察-分析-修复”过程的协作式多智能体框架。该框架首先通过诊断探针对代码进行插桩，以捕获细粒度的运行时追踪，从而深入洞察其内部执行过程。随后，基于这些追踪进行因果分析，以准确识别故障的根本原因。这一过程通过一种新颖的历史教训学习机制（HLLM）得到进一步增强，该机制从先前失败的修复尝试中提炼洞见，以指导后续的修正策略并防止类似错误的重复发生。为确保稳定收敛，回滚机制强制要求每次修复迭代都严格朝向正确解改进。在多个基准测试上的综合实验表明，TraceCoder在Pass@1准确率上相比现有先进基线实现了最高34.43%的相对提升。消融研究验证了各系统组件的重要性，其中迭代修复过程单独贡献了65.61%的相对准确率增益。此外，TraceCoder在准确率和成本效益方面均显著优于领先的迭代方法。