Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and selects the best candidate according to execution outcomes. Experiments on LiveCodeBench and BigCodeBench with five base LLMs show that, even without candidate search (k=1), Flare outperforms the strongest baseline with an absolute improvement from 1.72% to 7.42%. Furthermore, searching over 10 candidates yields an average improvement of 8.50% compared with no candidate search. When evaluated in isolation, our lightweight diagnostic model achieves the best performance compared with recent fault localization methods, demonstrating that it can provide reliable fine-grained guidance for code refinement.
翻译:大语言模型生成的代码常存在缺陷。现有方法依赖测试失败与自我批判等反馈信号对生成代码进行迭代精炼,但此类信号粒度粗放且层级过高,难以有效指导模型定位缺陷位置。本文提出Flare迭代框架,通过轻量级诊断模型预测行级可疑度信号,实现缺陷定位与代码精炼。针对诊断预测固有的不确定性,Flare在top-k可疑区域进行搜索,依据执行结果择优选取候选方案。在LiveCodeBench与BigCodeBench基准上,以五种基础大语言模型为实验对象,即使不进行候选搜索(k=1),Flare相较最强基线仍取得1.72%至7.42%的绝对性能提升。采用10个候选搜索时,平均性能较无候选搜索提升8.50%。独立评估表明,本轻量级诊断模型在最新缺陷定位方法中表现最优,可为代码精炼提供可靠的细粒度指导。