PromptPex: Automatic Test Generation for Language Model Prompts

Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex.

翻译：大型语言模型（LLM）正被广泛应用于各类应用中，针对这些模型的提示词作为类代码构件被集成到软件应用中。这些提示词的行为与传统软件十分相似：它们接收输入、生成输出并执行特定功能。然而，提示词在许多方面与传统代码存在差异，需要新的方法来确保其鲁棒性。例如，与传统软件不同，提示词的输出依赖于解释它的AI模型。此外，虽然自然语言提示词易于修改，但更新的影响却更难预测。因此，需要针对运行提示词的模型，开发新的测试、调试和修改方法。为解决部分此类问题，我们开发了PromptPex——一个基于LLM的工具，能够为给定提示词自动生成并评估单元测试。PromptPex从提示词中提取输入输出规范，并利用这些规范生成多样化、有针对性且有效的单元测试。这些测试在提示词发生变更时有助于识别回归问题，同时也可作为理解不同模型如何解释提示词的工具。我们使用PromptPex为八个基准提示词生成测试，并通过检验生成的测试能否使四个不同模型产生无效输出来评估测试质量。与精心构建的基于LLM的基线测试生成器相比，PromptPex持续生成的测试能导致更多无效模型输出。此外，通过从输入提示词中提取具体规范，PromptPex使提示词编写者能够清晰理解并测试其提示词的特定方面。PromptPex的源代码已发布于 https://github.com/microsoft/promptpex。