Large language models have shown remarkable aptitude in code generation, but still struggle on challenging tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of repairing mistakes in code which was originally generated by that very same model. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval or APPS, finding that when the cost of carrying out repair is taken into account, gains are often modest, vary significantly between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; boosting the feedback with stronger models, we observe performance gains even in settings where the model does not benefit from self-repair. Finally, we find that providing the model with feedback from human participants greatly benefits repair even for GPT-4, and carry out a brief qualitative analysis of the differences observed.
翻译:大型语言模型在代码生成方面表现出显著能力,但在处理具有挑战性的任务时仍存在困难。自修复(即模型对自身代码进行调试和错误修正)近来成为提升这些场景性能的流行方法。然而,文献中关于自修复何时生效及作用机制的研究极为有限,人们不禁质疑模型究竟能在多大程度上修复由同一模型最初生成的代码中的错误。本文分析了Code Llama、GPT-3.5和GPT-4在HumanEval或APPS数据集问题上执行自修复的能力,发现当考虑修复成本时,性能提升通常有限,在不同数据子集间差异显著,有时甚至完全不存在。我们假设这是因为自修复受限于模型对自身代码提供反馈的能力;通过使用更强模型增强反馈,即使在模型自身无法从自修复中受益的场景下,我们也观察到了性能提升。最后,我们发现向模型提供人类参与者的反馈能极大提升修复效果(即使对GPT-4也是如此),并对观察到的差异进行了简要定性分析。