Auditing for Human Expertise

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (`features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information not captured in a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that -- even absent normative concerns about accountability or interpretability -- accuracy is insufficient to justify algorithmic automation.

翻译：高风险预测任务（如患者诊断）通常由受过训练的人类专家处理。在这些场景中，自动化引发的一个常见担忧是，专家可能会运用难以建模的直觉和/或获取算法无法获得的信息（例如与患者的对话）。这自然引出一个问题：人类专家是否增加了算法预测器无法捕捉的附加价值？我们构建了一个统计框架，将这一问题转化为自然的假设检验。正如我们的框架所强调的，检测人类专家知识比简单比较专家预测与特定学习算法预测的准确性更为微妙。相反，我们提出一个简单程序，检验在控制可用输入（“特征”）后，专家预测是否与目标结果统计独立。拒绝该检验表明，人类专家可能为基于可用数据训练的任何算法增加价值，并对特定预测任务中能否实现人机“互补性”具有直接启示。我们利用某大型学术医疗系统急诊科收治数据展示了该程序的实用性：研究表明，医生对急性消化道出血（AGIB）患者的收治/出院决策似乎整合了标准算法筛查工具未捕捉的信息——尽管该筛查工具的准确性可能高于医生的自由裁量决策。这凸显出，即便不考虑问责性或可解释性等规范问题，仅凭准确性也不足以为算法自动化提供正当性。