Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.
翻译:人工智能系统被部署为人类决策中的协作者。然而,评估实践主要关注模型准确性,而非人-人工智能团队是否准备好进行安全有效的协作。实证证据表明,许多失败源于校准失调的依赖,包括在人工智能错误时过度使用以及在其有效时利用不足。本文提出了一种以团队就绪性为核心、用于评估人-人工智能决策的度量框架。我们引入了一个包含结果、依赖行为、安全信号和随时间学习这四部分评估指标的分类体系,并将这些指标与人-人工智能的融入与协作生命周期——理解-控制-改进(U-C-I)相联系。通过基于交互轨迹而非模型属性或自我报告信任来实现评估的操作化,我们的框架使得对校准、错误恢复和治理具有部署相关性的评估成为可能。我们旨在为人-人工智能就绪性支持更具可比性的基准和累积性研究,从而促进更安全、更负责任的人-人工智能协作。