Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
翻译:[译摘要] 近期大型语言模型(LLMs)的进展催生了各类评估基准。这些基准通常依赖单一指令模板,对特定任务中所有LLMs进行统一评估。本文基于6,500,000个实例,涉及20种不同LLMs及3个基准中的39项任务,全面分析了单提示评估所得结果的脆弱性。为提升分析稳健性,我们提出改用多样化的提示集合来评估LLMs。针对特定用例(如LLM开发者 vs. 关注特定下游任务的开发者),我们讨论了定制化评估指标,以更可靠、更具意义地评估LLM能力。随后我们实施这些标准并对多个模型进行评估,揭示了当前LLMs的真实优势与局限性。