The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
翻译:大语言模型(LLMs)的评估对于衡量其性能并降低潜在安全风险至关重要。本文提出了PromptBench,一个用于评估大语言模型的统一库。该库包含多个易于研究人员使用和扩展的关键组件:提示构建、提示工程、数据集与模型加载、对抗性提示攻击、动态评估协议以及分析工具。PromptBench旨在成为一个开放、通用且灵活的研究用代码库,可促进新基准创建、下游应用部署及新型评估协议设计等方面的原创研究。代码已开源在 https://github.com/microsoft/promptbench 并持续维护支持。