OLMES: A Standard for Language Model Evaluations

Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered recommendations guided by results from existing literature as well as new experiments investigating open questions.

翻译：人工智能的进步通常表现为新模型在衡量模型能力的任务上声称取得了改进的性能。评估语言模型尤其具有挑战性，因为对模型在任务上评估方式的微小改变都可能导致测量性能的巨大变化。目前缺乏通用的标准设置，因此不同模型以不同方式在同一任务上进行评估，导致关于哪些模型性能最佳的声称无法复现。我们提出了OLMES，一个完全文档化、实用、开放且可复现的大语言模型评估标准。在制定该标准的过程中，我们识别并审视了社区采用的各种评估实践中的可变因素——例如提示格式的细节、上下文示例的选择、概率归一化以及任务表述方式。特别是，OLMES支持在需要采用不自然的“完形填空”式表述多项选择题的较小基础模型，与能够利用原始表述的较大模型之间进行有意义的比较。OLMES包含了经过深思熟虑的建议，这些建议由现有文献结果以及针对开放性问题的新实验所指导。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日