Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.
翻译:大型语言模型(LLMs)已在广泛任务中展现出卓越能力,吸引了大量关注并被部署于众多下游应用。然而,如同双刃剑一般,LLMs也带来潜在风险:可能引发隐私数据泄露,或生成不当、有害及误导性内容。此外,LLMs的快速发展引发了关于缺乏足够安全防护的超级智能系统可能出现的担忧。为有效发挥LLM能力并确保其安全、有益发展,亟需对LLMs进行严谨、全面的评估。本综述旨在提供LLM评估的全景视角,将评估划分为三大类:知识与能力评估、对齐评估及安全性评估。除系统梳理这三方面的评估方法与基准外,我们还汇编了LLM在专业领域表现的相关评估,并探讨了覆盖能力、对齐、安全性与适用性评估的综合评估平台构建。期望本全面综述能激发更多关于LLM评估的研究兴趣,最终使评估成为引导LLM负责任发展的基石,引导其向最大化社会效益、最小化潜在风险的方向演进。相关论文精选列表已公开于https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers。