We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We show LangProp's applicability to general domains such as Sudoku and CartPole, as well as demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA. We show that LangProp can generate interpretable and transparent policies that can be verified and improved in a metric- and data-driven way. Our code is available at https://github.com/shuishida/LangProp.
翻译:摘要:我们提出LangProp,一个在监督学习和强化学习场景下迭代优化大语言模型(LLM)生成代码的框架。尽管LLM能零样本生成合理的编码方案,但这些方案往往并非最优。尤其在代码生成任务中,初始代码很可能在某些边缘案例中失效。LangProp能够在输入-输出对的数据集上自动评估代码性能,捕获所有异常,并将结果反馈给LLM训练循环,使LLM能够迭代改进其生成的代码。通过采用基于指标和数据的训练范式进行代码优化,可轻松借鉴模仿学习、DAgger和强化学习等传统机器学习技术的经验。我们展示了LangProp在数独和CartPole等通用领域的适用性,并首次证明其在CARLA自动驾驶场景中实现自动化代码优化的可行性。实验表明,LangProp能够生成可解释且透明的策略,这些策略可通过指标和数据进行验证与改进。我们的代码开源在https://github.com/shuishida/LangProp。