From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM's evaluation.
翻译:从预训练语言模型(PLM)到大语言模型(LLM),自然语言处理(NLP)领域在性能上取得了显著提升,并实现了广泛的实际应用。对一个研究领域的评估指引着其改进方向。然而,由于两大原因,LLM的全面评估极为困难。首先,传统NLP任务因LLM的卓越表现而变得不充分。其次,现有评估任务难以追随真实世界场景中广泛的应用需求。为解决这些问题,现有工作提出了多种基准以更好地评估LLM。为厘清学术界和工业界中的众多评估任务,我们调研了多篇关于LLM评估的论文。我们总结出LLM的4项核心能力,包括推理、知识、可靠性和安全性。针对每项能力,我们介绍了其定义、相应基准及指标。在此能力架构下,相似任务被合并以反映相应能力,同时新任务也可轻松纳入该系统。最后,我们给出了对LLM未来评估方向的建议。