Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations help people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies. To aid with this, we first review existing metrics suitable for application-grounded evaluation. We then establish criteria to select appropriate datasets, and using them, we find that only 4 out of over 50 datasets available for explainability research in NLP meet them. We then demonstrate the importance of reassessing the state of the art to form and study human-AI teams: teaming people with models for certain tasks might only now start to make sense, and for others, it remains unsound. Finally, we present the exemplar studies of human-AI decision-making for one of the identified tasks -- verifying the correctness of a legal claim given a contract. Our results show that providing AI predictions, with or without explanations, does not cause decision makers to speed up their work without compromising performance. We argue for revisiting the setup of human-AI teams and improving automatic deferral of instances to AI, where explanations could play a useful role.
翻译:可解释性是否是一个虚假的承诺?这一争论源于缺乏足够证据表明解释在引入的场景中确实对人有所帮助。要解决这一问题,需要更多以人为中心、基于实际应用的解释评估研究。然而,由于自然语言处理领域尚未建立此类研究的规范指南,习惯于标准化代理评估的研究人员必须在研究中为人类-AI协作团队探索合适的度量标准、任务、数据集及合理模型。为推进此项工作,我们首先回顾了适用于应用场景评估的现有指标。随后,我们确立了数据集筛选标准,并据此发现:在自然语言处理可解释性研究可用的50余个数据集中,仅有4个符合标准。我们进而论证了重新评估技术现状以构建和研究人机协作团队的重要性:在某些任务中让人与模型协作可能直到现在才开始具有意义,而在其他任务中这种协作仍不合理。最后,我们针对已识别任务之一——在给定合同条件下验证法律声明的正确性——开展了人机决策的示范性研究。结果表明,无论是否提供解释,AI预测的呈现均不会使决策者在保持性能的前提下加速工作。我们主张重新审视人机协作团队的构建机制,并改进实例向AI的自动移交流程,而解释机制或可在其中发挥重要作用。