Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations aid people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies. To help with this, we first review fitting existing metrics. We then establish requirements for datasets to be suitable for application-grounded evaluations. Among over 50 datasets available for explainability research in NLP, we find that 4 meet our criteria. By finetuning Flan-T5-3B, we demonstrate the importance of reassessing the state of the art to form and study human-AI teams. Finally, we present the exemplar studies of human-AI decision-making for one of the identified suitable tasks -- verifying the correctness of a legal claim given a contract.
翻译:可解释性是否是一个虚假的承诺?这一争论源于缺乏足够证据表明解释在引入场景中确实能帮助人们。要解决这一问题,需要更多以人为中心、基于实际应用的解释评估研究。然而,由于自然语言处理领域尚未建立此类研究的规范指南,习惯于标准化代理评估的研究人员必须在研究中为人类-人工智能团队探索合适的度量标准、任务、数据集及合理模型。为推进此项工作,我们首先梳理了适用的现有评估指标,进而确立了适用于应用场景评估的数据集所需满足的要求。在自然语言处理领域可用于解释性研究的50余个数据集中,我们发现仅有4个符合标准。通过微调Flan-T5-3B模型,我们论证了重新评估当前技术水平以构建和研究人类-人工智能团队的重要性。最后,我们以一项已识别的适用任务——基于合同验证法律声明的正确性——为例,展示了人类-人工智能协同决策的示范性研究。