The use of Artificial Intelligence (AI), or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a large language model in particular--leads to a worse classification performance.
翻译:人工智能(AI)——或更广义的数据驱动算法——在当今社会的应用已无处不在。然而,在许多情况下,尤其是在高风险场景中,最终决策仍由人类作出。因此,核心问题在于:相较于纯人类或纯AI系统,AI是否能够帮助人类做出更优决策?我们提出了一种新的方法论框架,以在最小假设条件下实证回答这一问题。我们基于基线潜在结果,采用标准分类指标来衡量决策者做出正确决策的能力。我们考虑单盲且无混杂的处理分配,其中AI生成建议的提供被假定为在人类作出最终决策的案例中随机分配。在此研究设计下,我们展示了如何比较三种替代决策系统——纯人类系统、人机协同系统和纯AI系统——的性能。重要的是,纯AI系统包含任何个体化处理分配,包括那些未在原研究中使用的分配方式。我们还阐明了何时应向人类决策者提供AI建议,以及何时应遵循此类建议。我们将所提出的方法应用于我们自身评估审前风险评估工具的随机对照试验。研究发现,风险评估建议并未提高法官决定施加现金保释金的分类准确性。此外,我们发现用算法——特别是风险评估分数和大型语言模型——替代人类法官会导致更差的分类性能。