Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.
翻译:大语言模型(LLMs)在学术界和工业界日益受到关注,这主要得益于其在各类应用中展现出的前所未有的性能。随着LLMs在科研与日常使用中扮演愈发重要的角色,对其评估变得尤为关键——这不仅涉及任务层面的评估,更涵盖社会层面以更好地理解其潜在风险。近年来,研究者已从不同角度对LLMs进行了大量探索。本文对LLMs评估方法进行了全面综述,重点聚焦三个关键维度:评估什么、在哪里评估以及如何评估。首先,我们从评估任务视角展开概述,涵盖通用自然语言处理任务、推理、医疗应用、伦理、教育、自然科学与社会科学、智能体应用及其他领域。其次,通过深入解析评估方法与基准(评估LLMs性能的关键组成部分),我们回应了"在哪里"与"如何"评估的问题。随后,我们总结了LLMs在不同任务中的成功案例与失败案例。最后,我们展望了LLMs评估领域面临的若干未来挑战。本研究旨在为LLMs评估领域的研究人员提供宝贵见解,从而助力开发更高效的LLMs。我们的核心观点是:应将评估视为辅助LLMs发展的基础性学科。相关开源资料持续维护于:https://github.com/MLGroupJLU/LLM-eval-survey。