Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.
翻译:大型语言模型(LLMs)在代码生成方面展现出卓越能力,但在处理复杂编程任务时仍存在不足。自我修复——即模型自行调试并修正代码错误——近期已成为提升此类任务性能的流行方法。然而,现有文献中关于自我修复有效运作机制与适用条件的研究极为有限,人们不禁质疑:当代码由同一模型生成时,模型究竟能在多大程度上提供关于错误原因的准确反馈?本文针对APPS这一包含多种编程挑战的高难度数据集,分析了GPT-3.5与GPT-4执行自我修复的能力。为此,我们首先提出一种名为pass@t的新型评估策略,通过测量模型采样token总数对应的任务通过率,实现了与纯采样方法的公平比较。应用该评估策略发现,自我修复仅在GPT-4中展现有效性。同时观察到自我修复受限于反馈环节:使用GPT-4对GPT-3.5生成的程序提供反馈,以及邀请人类编程专家对GPT-4生成的程序提供反馈时,任务性能均获得显著提升。