Comprehensive and accurate evaluation of general-purpose AI systems such as large language models allows for effective mitigation of their risks and deepened understanding of their capabilities. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems, as present techniques lack a scientific foundation for predicting their performance on unforeseen tasks and explaining their varying performance on specific task items or user inputs. Moreover, existing benchmarks of specific tasks raise growing concerns about their reliability and validity. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation. Psychometrics, the science of psychological measurement, provides a rigorous methodology for identifying and measuring the latent constructs that underlie performance across multiple tasks. We discuss its merits, warn against potential pitfalls, and propose a framework to put it into practice. Finally, we explore future opportunities of integrating psychometrics with the evaluation of general-purpose AI systems.
翻译:对大型语言模型等通用人工智能系统进行全面准确的评估,能够有效降低其风险,并加深对其能力的理解。当前的评估方法主要基于特定任务的基准测试,不足以充分评估这些多功能AI系统,因为现有技术缺乏科学基础来预测其在未预见任务上的表现,也无法解释其在特定任务项目或用户输入上的差异表现。此外,现有特定任务基准测试在信度和效度方面日益引发担忧。为应对这些挑战,我们建议从面向任务的评估转向面向构念的评估。心理测量学作为心理测量的科学,提供了一套严谨的方法论来识别和测量构成多项任务表现基础的潜在构念。我们讨论了其优势,警示了潜在陷阱,并提出了一套实践框架。最后,我们探讨了将心理测量学与通用人工智能系统评估相结合的未来机遇。