The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
翻译:大语言模型(LLMs)的评估对于衡量其性能并缓解潜在安全风险至关重要。本文提出PromptBench,一个用于评估大语言模型的统一库。该库包含若干易于研究人员使用和扩展的关键组件:提示构建、提示工程、数据集与模型加载、对抗性提示攻击、动态评估协议以及分析工具。PromptBench被设计为一个开放、通用且灵活的代码库,旨在促进创建新基准、部署下游应用及设计新评估协议等原创研究。代码开源于:https://github.com/microsoft/promptbench,并将持续获得支持。