From LLMs to Agents in Programming: The Impact of Providing an LLM with a Compiler

Large Language Models have demonstrated a remarkable capability in natural language and program generation and software development. However, the source code generated by the LLMs does not always meet quality requirements and may fail to compile. Therefore, many studies evolve into agents that can reason about the problem before generating the source code for the solution. The goal of this paper is to study the degree to which such agents benefit from access to software development tools, in our case, a gcc compiler. We conduct a computational experiment on the RosettaCode dataset, on 699 programming tasks in C. We evaluate how the integration with a compiler shifts the role of the language model from a passive generator to an active agent capable of iteratively developing runnable programs based on feedback from the compiler. We evaluated 16 language models with sizes ranging from small (135 million) to medium (3 billion) and large (70 billion). Our results show that access to a compiler improved the compilation success by 5.3 to 79.4 percentage units in compilation without affecting the semantics of the generated program. Syntax errors dropped by 75%, and errors related to undefined references dropped by 87% for the tasks where the agents outperformed the baselines. We also observed that in some cases, smaller models with a compiler outperform larger models with a compiler. We conclude that it is essential for LLMs to have access to software engineering tools to enhance their performance and reduce the need for large models in software engineering, such as reducing our energy footprint.

翻译：大型语言模型在自然语言处理、程序生成和软件开发领域展现出了卓越的能力。然而，LLM生成的源代码并不总能满足质量要求，甚至可能无法通过编译。因此，许多研究演化为能够先对问题进行推理、再生成解决方案源代码的智能体。本文旨在研究此类智能体从软件开发工具（在本研究中为gcc编译器）的访问中获益的程度。我们在RosettaCode数据集上进行了计算实验，涉及699个C语言编程任务。我们评估了与编译器的集成如何将语言模型从被动生成器转变为能够基于编译器反馈迭代开发可运行程序的主动智能体。我们评估了16个语言模型，其规模涵盖小型（1.35亿参数）、中型（30亿参数）到大型（700亿参数）。实验结果表明，访问编译器可将编译成功率提升5.3至79.4个百分点，且不影响生成程序的语义。在智能体表现优于基线的任务中，语法错误减少了75%，未定义引用错误减少了87%。我们还观察到在某些情况下，配备编译器的小型模型性能可超越配备编译器的大型模型。我们得出结论：为LLM提供软件工程工具对于提升其性能至关重要，这能降低软件工程中对大型模型的依赖需求，例如减少能源消耗。