Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.
翻译:大型语言模型(LLMs)在广泛的任务中展现了卓越的能力,吸引了大量关注并被部署于众多下游应用场景。然而,如同双刃剑一般,LLMs也带来了潜在风险,可能引发隐私数据泄露,或生成不当、有害及误导性内容。此外,LLMs的快速发展引发了关于超级智能系统在缺乏充分保障下可能出现的担忧。为有效利用LLMs的能力并确保其安全与有益发展,对其开展严谨且全面的评估至关重要。本综述旨在提供LLMs评估的全景视角。我们将LLMs评估分为三大类:知识与能力评估、对齐评估以及安全评估。除对这三方面评估方法与基准的全面梳理外,我们还汇编了LLMs在特定领域表现的相关评估研究,并探讨了覆盖能力、对齐、安全及适用性的综合评估平台构建。我们期望这一全面概述能激发更多关于LLMs评估的研究兴趣,最终使评估成为指导LLMs负责任发展的基石,引导其朝着最大化社会效益并最小化潜在风险的方向演进。相关论文精选列表已公开于 https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers。