Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
翻译:随着大型语言模型(LLM)的近期进展,各类评估基准相继出现。这些基准通常依赖单一的指令模板来评估所有LLM在特定任务上的表现。本文基于650万样本实例,对20种不同LLM在3个基准的39项任务上的单一提示评估结果进行了全面脆弱性分析。为提升分析鲁棒性,我们提出改用多样化提示集评估LLM。针对具体应用场景(如LLM开发者与关注特定下游任务的开发者),我们探讨了定制化评估指标的设计方法,以确保更可靠、更具意义的LLM能力评估。最终实施这些评估标准并对多种模型进行评测,为揭示当前LLM的真实优势与局限提供了深刻洞见。