The use of Artificial Intelligence (AI), or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to experimentally answer this question without additional assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded experimental design, in which the provision of AI-generated recommendations is randomized across cases with humans making final decisions. Under this experimental design, we show how to compare the performance of three alternative decision-making systems -- human-alone, human-with-AI, and AI-alone. We also show when to provide a human-decision maker with AI recommendations and when they should follow such recommendations. We apply the proposed methodology to the data from our own randomized controlled trial of a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Our analysis also shows that the risk assessment-alone decisions generally perform worse than human decisions with or without algorithmic assistance.
翻译:人工智能(AI)或更广义的数据驱动算法在当今社会的应用已无处不在。然而,在许多情况下,尤其是在风险较高时,最终决策仍由人类做出。因此,关键问题在于,相较于纯人类或纯AI系统,AI是否有助于人类做出更优决策。我们提出了一种新的方法论框架,无需额外假设即可通过实验回答这一问题。我们基于基线潜在结果,使用标准分类指标来衡量决策者做出正确决策的能力。我们采用单盲实验设计,其中AI生成建议的提供在案例间随机分配,而最终决策由人类做出。在此实验设计下,我们展示了如何比较三种替代决策系统——纯人类、人类辅助AI和纯AI——的性能。我们还阐明了何时应向人类决策者提供AI建议,以及他们何时应遵循此类建议。我们将所提出的方法论应用于自身关于审前风险评估工具的随机对照试验数据。研究发现,风险评估建议并未提高法官决定施加现金保释金的分类准确性。我们的分析还表明,纯风险评估决策的表现通常逊于人类在有或无算法辅助情况下做出的决策。