人工智能能否帮助人类做出更优决策？实验与观测研究的统计评估框架 (Does AI help humans make better decisions? A statistical evaluation framework for experimental and observational studies)

The use of Artificial Intelligence (AI), or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a large language model in particular--leads to a worse classification performance.

翻译：人工智能（AI）——或更广义的数据驱动算法——在当今社会的应用已无处不在。然而，在许多情况下，尤其是在高风险场景中，最终决策仍由人类作出。因此，核心问题在于：相较于纯人类或纯AI系统，AI是否能够帮助人类做出更优决策？我们提出了一种新的方法论框架，以在最小假设条件下实证回答这一问题。我们基于基线潜在结果，采用标准分类指标来衡量决策者做出正确决策的能力。我们考虑单盲且无混杂的处理分配，其中AI生成建议的提供被假定为在人类作出最终决策的案例中随机分配。在此研究设计下，我们展示了如何比较三种替代决策系统——纯人类系统、人机协同系统和纯AI系统——的性能。重要的是，纯AI系统包含任何个体化处理分配，包括那些未在原研究中使用的分配方式。我们还阐明了何时应向人类决策者提供AI建议，以及何时应遵循此类建议。我们将所提出的方法应用于我们自身评估审前风险评估工具的随机对照试验。研究发现，风险评估建议并未提高法官决定施加现金保释金的分类准确性。此外，我们发现用算法——特别是风险评估分数和大型语言模型——替代人类法官会导致更差的分类性能。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日