AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.
翻译:人工智能评估主要聚焦于测量能力,其中受项目反应理论启发的形式化方法正得到日益广泛的应用。然而,倾向性——模型表现出特定行为的趋势——在决定性能与安全结果方面起着核心作用。传统项目反应理论将模型在任务上的成功描述为模型能力与任务需求的单调函数,这种方法不适用于倾向性测量,因为倾向性的过度与不足都可能引发问题。本文首次提出测量人工智能倾向性的形式化框架,通过双逻辑斯蒂模型描述任务成功率,当模型倾向性处于"理想区间"时获得高成功概率。进一步地,我们利用配备新开发的任务无关评估量表的LLM来估计理想区间的边界。将该框架应用于六组倾向性被双向激发的LLM模型系列,我们发现能够量化倾向性偏移程度及其对任务的影响。关键的是,基于单一基准估计的倾向性能够成功预测模型在保留任务上的行为。此外,结合倾向性与能力的预测效力显著优于单独使用任一项指标。更广泛而言,我们的框架展示了如何进行严格的倾向性测量,并证明其相较于单纯使用能力评估在预测人工智能行为方面具有优势。