Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.
翻译:仅使用单个提示评估大语言模型已被证明不可靠,微小的改动可能导致显著的性能差异。然而,为更稳健的多提示评估生成所需的提示变体颇具挑战性,限制了其在实际中的应用。为此,我们提出PromptSuite,一个能够自动生成多种提示的框架。PromptSuite具有灵活性——可即插即用于广泛的任务和基准测试。它采用模块化提示设计,允许对每个组件进行可控扰动,并且可扩展,支持新增组件和扰动类型。通过一系列案例研究,我们展示PromptSuite能够提供有意义的变体以支持稳健的评估实践。所有资源,包括Python API、源代码、用户友好的网页界面及演示视频,均可在https://eliyahabba.github.io/PromptSuite/获取。