As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.
翻译:随着机器学习模型日益通用化,我们需要以更丰富、更具意义的方式对其进行表征。本文描述了一种从多样化实验数据中推断系统认知概况的方法。为此,我们引入了测量布局框架,该框架建模了任务实例特征与系统能力之间的交互如何影响最终表现。这些特征必须以复杂的方式进行三角测量,才能从非群体数据中推断能力——这对传统心理测量学与推断工具构成了挑战。通过使用贝叶斯概率编程库PyMC,我们针对两类场景推断出不同智能体的认知特征:动物智能奥运会中的68名实际参赛者,以及奥-皮亚杰物体恒存性测试任务序列中的30个合成智能体。本研究展示了以能力为导向的评估方法的潜力。