Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Moreover, we show how PromptEval can be useful in LLM-as-a-judge and best prompt identification applications.
翻译:当前用于比较大型语言模型的主流基准测试通常依赖于有限的提示模板集,这可能无法全面捕捉模型的能力,并会影响排行榜结果的可复现性。许多近期研究通过实证验证了提示敏感性,并倡导改进大型语言模型的评估方法。本文探讨的核心问题是如何估计模型在多种提示变体上的性能分布,而非仅寻找单一评估提示。我们提出了PromptEval方法,该方法通过跨提示和示例的信息共享,在有限的实际评估预算下,准确估计模型在大量提示上的性能分布。所得分布可用于计算性能分位数,进而构建多种稳健的性能指标(例如前95%分位数或中位数)。我们证明了PromptEval能够一致地估计性能分布,并在三个重要的大型语言模型基准测试(MMLU、BIG-bench Hard和LMentry)上进行了实证验证。例如,在相当于两次单提示评估的预算下,PromptEval能准确估计MMLU数据集中100个提示模板的性能分位数。此外,我们还展示了PromptEval在"以LLM作为评判者"和最佳提示识别等应用场景中的实用价值。