Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
翻译:近年来,大语言模型(LLM)的进展催生了多种评估基准。这些基准通常依赖单一指令模板来评估特定任务上的所有LLM。本文通过对跨越650万个实例的全面分析,涉及20种不同LLM及来自3个基准的39项任务,揭示了单提示评估结果的脆弱性。为提升分析鲁棒性,我们提出改用一组多样化提示对LLM进行评估。针对特定应用场景(例如,LLM开发者与关注具体下游任务的开发者),我们讨论了定制化评估指标,以确保对LLM能力进行更可靠、更有意义的评估。最终,我们实施这些标准并对多个模型展开评估,揭示了当前LLM的真实优势与局限性。