Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect real-world use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, resulting in limited conclusions of methods' real-world utility. In this work, we seek to bridge this gap by conducting a study that evaluates three popular explainable ML methods in a setting consistent with the intended deployment context. We build on a previous study on e-commerce fraud detection and make crucial modifications to its setup relaxing the simplifying assumptions made in the original work that departed from the deployment context. In doing so, we draw drastically different conclusions from the earlier work and find no evidence for the incremental utility of the tested methods in the task. Our results highlight how seemingly trivial experimental design choices can yield misleading conclusions, with lessons about the necessity of not only evaluating explainable ML methods using tasks, data, users, and metrics grounded in the intended deployment contexts but also developing methods tailored to specific applications. In addition, we believe the design of this experiment can serve as a template for future study designs evaluating explainable ML methods in other real-world contexts.
翻译:现有对可解释机器学习方法的评估大多依赖简化假设或代理指标,未能反映真实应用场景;少数在真实环境中进行的较稳健评估又存在设计缺陷,导致对方法实际效用的结论有限。为弥合这一差距,本研究在符合预期部署情境的条件下,对三种主流可解释机器学习方法开展评估。我们基于此前一项电子商务欺诈检测研究,对其实验设置进行关键改进,放宽了原工作中脱离部署情境的简化假设。通过此举,我们得出了与原研究截然不同的结论,发现测试方法在该任务中并无证据显示其增量效用。本研究结果揭示了看似微不足道的实验设计选择如何导致误导性结论,强调不仅需使用与预期部署情境一致的任务、数据、用户和指标来评估可解释机器学习方法,更需开发面向特定应用的定制化方法。此外,我们相信本实验设计可作为模板,为未来其他真实场景中可解释机器学习方法的评估研究提供参考。