Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the Language Model Evaluation Harness (lm-eval): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.
翻译:语言模型的有效评估仍然是自然语言处理领域的一个开放性挑战。研究人员和工程师面临着诸多方法论问题,例如模型对评估设置的敏感性、不同方法间进行恰当比较的困难,以及可复现性和透明度的缺乏。本文基于我们三年来评估大语言模型的经验,为研究人员提供指导与实践经验。首先,我们概述了语言模型评估中常见的挑战。其次,我们阐述了应对或减轻这些挑战对研究影响的最佳实践方案。第三,我们介绍了语言模型评估工具库(lm-eval):一个旨在解决上述问题的开源库,用于实现独立、可复现且可扩展的语言模型评估。我们将详细描述该库的功能特性,并通过实际案例研究展示该库如何帮助缓解这些方法论层面的问题。