Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; using a stronger model to artificially boost the quality of the feedback, we observe substantially larger performance gains. Similarly, a small-scale study in which we provide GPT-4 with feedback from human participants suggests that even for the strongest models, self-repair still lags far behind what can be achieved with human-level debugging.
翻译:大型语言模型在代码生成方面展现出卓越能力,但在处理复杂任务时仍存在困难。自我修复(即模型自行调试并修复自身代码)近年来已成为提升代码生成性能的流行方法。然而,尽管其日益普及,现有关于自我修复的研究范围仍然有限;在许多场景下,其实际效能仍未被充分理解。本文分析了Code Llama、GPT-3.5和GPT-4在HumanEval和APPS数据集问题上的自我修复能力。研究发现:当考虑修复成本时,性能提升往往有限,且在不同数据子集间存在显著差异,有时甚至完全没有提升。我们推测这是因为自我修复受限于模型对自身代码提供反馈的能力——通过使用更强模型人为提升反馈质量后,观察到了显著更大的性能增益。类似地,我们通过小型实验为GPT-4提供人类参与者的反馈,结果表明:即使对于最强模型,自我修复仍远未达到人类水平调试所能实现的效果。