Artificial intelligence (AI) has witnessed an evolution from task-specific to general-purpose systems that trend toward human versatility. As AI systems begin to play pivotal roles in society, it is important to ensure that they are adequately evaluated. Current AI benchmarks typically assess performance on collections of specific tasks. This has drawbacks when used for assessing general-purpose AI systems. First, it is difficult to predict whether AI systems could complete a new task it has never seen or that did not previously exist. Second, these benchmarks often focus on overall performance metrics, potentially overlooking the finer details crucial for making informed decisions. Lastly, there are growing concerns about the reliability of existing benchmarks and questions about what is being measured. To solve these challenges, this paper suggests that psychometrics, the science of psychological measurement, should be placed at the core of evaluating general-purpose AI. Psychometrics provides a rigorous methodology for identifying and measuring the latent constructs that underlie performance across multiple tasks. We discuss its merits, warn against potential pitfalls, and propose a framework for putting it into practice. Finally, we explore future opportunities to integrate psychometrics with AI.
翻译:人工智能(AI)已从专用系统演进为趋向人类多才多艺的通用系统。随着AI系统开始在社会中发挥关键作用,确保对其进行充分评估至关重要。当前的AI基准测试通常评估其在特定任务集合上的表现,这在评估通用人工智能系统时存在缺陷。首先,难以预测AI系统能否完成从未见过或先前不存在的新任务。其次,这些基准测试往往关注整体性能指标,可能忽视对做出明智决策至关重要的细节。最后,人们对现有基准测试的可靠性以及其衡量内容的担忧日益增加。为解决这些挑战,本文建议将心理测量学——即心理测量的科学——置于评估通用人工智能的核心。心理测量学为识别并测量构成多任务表现基础的潜在构念提供了严谨的方法论。我们讨论了其优势,警示了潜在陷阱,并提出了一套实践框架。最后,我们探索了将心理测量学与人工智能整合的未来机遇。