Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks

LLMs can be used in a variety of code related tasks such as translating from one programming language to another, implementing natural language requirements and code summarization. Artifacts generated by state of the art LLM technology are expected to be useful in the sense that a user will be able to use the LLM generated artifact after a small number of easy modifications. Quantifying this vague notion is challenging and it is thus hard to determine the quality of code related LLM solutions. We refer to evaluation of LLM solutions using LLM judgment as "LLM as a Judge", or LaaJ for short. In this work we introduce a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The purpose of the benchmark is two fold, namely, it is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. To that end, we developed an automated benchmark generation engine, which generates code in multiple programming languages for multiple code related tasks and which serves as the input for LaaJ evaluation. We utilize a graph representation, G, of the potential code related generations. The graph vertices are generated artifacts and edges represent possible generations, e.g., the generation of a Java program from its natural language requirements. Utilizing a chain of LLM agents and G we generate code related artifacts. Using cycles in G we formulate expectations on the generated artifacts. Taking advantage of these formulated expectations enables the development and testing of reliable LLM judgement for usefulness of the artifacts generated by the solution. Our approach enables the creation of high quality code task solutions.

翻译：大语言模型（LLM）可应用于多种代码相关任务，例如编程语言间的转换、自然语言需求实现以及代码摘要生成。当前先进的LLM技术所生成的代码制品应具备实用性，即用户经过少量简易修改即可直接使用。量化这种模糊概念具有挑战性，因此难以评估代码相关LLM解决方案的质量。我们将使用LLM评判来评估LLM解决方案的方法称为"LLM即评判者"（简称LaaJ）。本研究提出一种利用自动生成基准来构建和评估LaaJ实现的方法论。该基准具有双重作用：既用于开发和验证LaaJ系统，又借助LaaJ来验证和测试LLM代码相关解决方案。为此，我们开发了自动化基准生成引擎，该引擎能为多种代码相关任务生成多编程语言代码，并作为LaaJ评估的输入。我们采用图结构G来表示潜在的代码生成关系，其中顶点代表生成的代码制品，边表示可能的生成路径（例如根据自然语言需求生成Java程序）。通过LLM智能体链与图G的协同，我们生成代码相关制品。利用图G中的循环结构，我们对生成制品建立预期规范。基于这些规范预期，我们能够为解决方案生成的代码制品开发并测试可靠的LLM实用性评判机制。本方法能够支持高质量代码任务解决方案的构建。