You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that installs arbitrary projects, configures them to run test cases, and produces project-specific scripts to reproduce the setup. Inspired by the way a human developer would address this task, our approach is a large language model-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/55 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5\%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of 0.16 dollars, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

翻译：执行项目的测试套件在许多场景中至关重要，例如评估代码质量和覆盖率、验证开发者或自动化工具所做的代码变更，以及确保与依赖项的兼容性。尽管其重要性不言而喻，但在实践中执行项目的测试套件可能颇具挑战性，因为不同的项目使用不同的编程语言、软件生态系统、构建系统、测试框架及其他工具。这些挑战使得创建一个能在不同项目中可靠运行的通用测试执行方法变得困难。本文提出了ExecutionAgent，这是一种自动化技术，能够安装任意项目、配置其运行测试用例，并生成项目特定的脚本来复现该设置。受人类开发者处理此类任务方式的启发，我们的方法是一个基于大语言模型的智能体，能够自主执行命令并与主机系统交互。该智能体使用元提示来收集与给定项目相关的最新技术指南，并根据先前步骤的反馈迭代优化其过程。我们的评估将ExecutionAgent应用于50个开源项目，这些项目使用了14种不同的编程语言以及多种不同的构建和测试工具。该方法成功执行了33/55个项目的测试套件，同时与基准测试套件执行结果的偏差仅为7.5%。这些结果相较于先前最佳可用技术提升了6.6倍。该方法产生的成本是合理的，平均每个项目的执行时间为74分钟，LLM成本为0.16美元。我们设想ExecutionAgent能成为开发者、自动化编程工具以及需要在各种项目中执行测试的研究人员的宝贵工具。