Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.
翻译:大型语言模型在代码生成方面展现出卓越能力,但在应对复杂编程任务时仍存在不足。自我修复机制(即模型自主调试并修正自身生成的代码错误)近期已成为提升此类场景性能的流行方法。然而,现有文献中关于自我修复的有效工作机理与适用条件的研究极为有限,人们不禁质疑:当代码由同一模型生成时,该模型究竟能在多大程度上提供准确的错误原因反馈?本文针对APPS这一由多样化编程挑战构成的复杂数据集,系统分析了GPT-3.5与GPT-4执行自我修复的能力。为此,我们首先建立名为pass@t的新型评估策略,通过测量任务通过率与模型采样总令牌数的比例,实现与纯采样方法的公平比较。基于该评估策略发现:自我修复的有效性仅在GPT-4中显现。同时观察到自我修复的瓶颈在于反馈阶段——当使用GPT-4为GPT-3.5生成的程序提供反馈,以及由人类专家程序员为GPT-4生成的程序提供反馈时,我们取得了显著的性能提升。