Holistic Evaluation of Language Models

Percy Liang,Rishi Bommasani,Tony Lee,Dimitris Tsipras,Dilara Soylu,Michihiro Yasunaga,Yian Zhang,Deepak Narayanan,Yuhuai Wu,Ananya Kumar,Benjamin Newman,Binhang Yuan,Bobby Yan,Ce Zhang,Christian Cosgrove,Christopher D. Manning,Christopher Ré,Diana Acosta-Navas,Drew A. Hudson,Eric Zelikman,Esin Durmus,Faisal Ladhak,Frieda Rong,Hongyu Ren,Huaxiu Yao,Jue Wang,Keshav Santhanam,Laurel Orr,Lucia Zheng,Mert Yuksekgonul,Mirac Suzgun,Nathan Kim,Neel Guha,Niladri Chatterji,Omar Khattab,Peter Henderson,Qian Huang,Ryan Chi,Sang Michael Xie,Shibani Santurkar,Surya Ganguli,Tatsunori Hashimoto,Thomas Icard,Tianyi Zhang,Vishrav Chaudhary,William Wang,Xuechen Li,Yifan Mai,Yuhui Zhang,Yuta Koreeda

from arxiv, Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/helm/v1.0

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

翻译：语言模型（LMs）正成为几乎所有主流语言技术的基石，但其能力、局限性和风险尚未得到充分理解。我们提出语言模型整体评估（HELM），旨在提升语言模型的透明度。首先，我们对语言模型所涉及的海量潜在场景（即用例）和指标（即理想特性）进行系统分类。随后基于覆盖率和可行性选取一个广泛子集，并指出缺失或未充分代表的领域（例如针对被忽视英语方言的问答任务、可信度指标）。其次，我们采用多指标方法：在16个核心场景中，我们尽可能（87.5%的情况下）测量7项指标（准确率、校准度、鲁棒性、公平性、偏差、毒性、效率），确保准确率之外的指标不被忽视，并清晰揭示指标间的权衡关系。我们还基于26个针对性场景开展7项专项评估，用于分析特定方面（如推理、虚假信息）。第三，我们对30个主流语言模型（涵盖开源、受限访问和闭源模型）在全部42个场景上进行大规模评估，其中21个场景此前未在主流语言模型评估中使用。在HELM之前，模型平均仅覆盖17.9%的核心HELM场景，部分知名模型甚至没有任何共享场景。我们将这一比例提升至96.0%：现在所有30个模型已在标准化条件下对相同核心场景和指标进行了密集基准测试。我们的评估揭示了25项顶层发现。为保证完全透明，我们公开发布所有原始模型提示及生成结果以供进一步分析，同时提供通用模块化工具包。我们期望HELM成为社区的活基准，持续更新场景、指标和模型。