Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.
翻译:评估人工智能系统的实际能力,需要将基准测试性能与人类可解释的任务难度度量相挂钩。现有依赖直接人工标注任务完成时间的方法成本高昂、噪声大且难以跨基准扩展。本文提出BRIDGE——一个统一的心理测量框架,该框架从模型响应中学习潜在难度标度,并将其锚定至人类任务完成时间。通过采用双参数逻辑斯蒂项目反应理论模型,我们基于跨多个基准的模型性能数据,联合估计潜在任务难度与模型能力。我们证明潜在任务难度与人类完成时间的对数呈线性关系,从而仅凭模型性能即可为新基准推断人类任务完成时间。利用这种对齐关系,我们以人类任务时长为单位预测前沿模型能力,并独立复现了METR的指数缩放结果:50%可解任务范围约每6个月翻倍。