Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered recommendations guided by results from existing literature as well as new experiments investigating open questions.
翻译:人工智能的进步通常表现为新模型在衡量模型能力的任务上声称取得了改进的性能。评估语言模型尤其具有挑战性,因为对模型在任务上评估方式的微小改变都可能导致测量性能的巨大变化。目前缺乏通用的标准设置,因此不同模型以不同方式在同一任务上进行评估,导致关于哪些模型性能最佳的声称无法复现。我们提出了OLMES,一个完全文档化、实用、开放且可复现的大语言模型评估标准。在制定该标准的过程中,我们识别并审视了社区采用的各种评估实践中的可变因素——例如提示格式的细节、上下文示例的选择、概率归一化以及任务表述方式。特别是,OLMES支持在需要采用不自然的“完形填空”式表述多项选择题的较小基础模型,与能够利用原始表述的较大模型之间进行有意义的比较。OLMES包含了经过深思熟虑的建议,这些建议由现有文献结果以及针对开放性问题的新实验所指导。