Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.
翻译:现代语言模型(LM)为能力评估带来了新的挑战。静态基准不可避免地趋于饱和,却无法为基于LM的系统的部署容差提供可靠依据;然而,开发者仍基于这些有缺陷的指标,宣称其模型具备推理或开放域语言理解等泛化特性。语言模型的科学与实践需要一种新的基准评测方法,即通过动态评估来度量特定能力。为确保度量指标的可靠性,我们需要建立一门新的模型计量学学科——其核心在于如何构建能够预测部署性能的基准。基于我们的评估标准,我们提出:建立一个专注于开发工具并研究如何度量系统能力的模型计量学实践者社群,是满足这些需求、并为人工智能讨论增添清晰度的最佳途径。