Machine learning from explanations (MLX) is an approach to learning that uses human-provided explanations of relevant or irrelevant features for each input to ensure that model predictions are right for the right reasons. Existing MLX approaches rely on local model interpretation methods and require strong model smoothing to align model and human explanations, leading to sub-optimal performance. We recast MLX as a robustness problem, where human explanations specify a lower dimensional manifold from which perturbations can be drawn, and show both theoretically and empirically how this approach alleviates the need for strong model smoothing. We consider various approaches to achieving robustness, leading to improved performance over prior MLX methods. Finally, we show how to combine robustness with an earlier MLX method, yielding state-of-the-art results on both synthetic and real-world benchmarks.
翻译:从解释中学习(MLX)是一种利用人类提供的每个输入中相关或无关特征的解释来确保模型预测基于正确理由的机器学习方法。现有的MLX方法依赖局部模型解释技术,并需要通过强模型平滑来对齐模型解释与人类解释,这导致性能欠佳。我们将MLX重新定义为鲁棒性问题,即人类解释指定了一个低维流形,从中可以提取扰动,并从理论和实验两方面证明该方法如何减轻对强模型平滑的需求。我们探讨了实现鲁棒性的多种途径,从而比之前的MLX方法取得了更优的性能。最后,我们展示了如何将鲁棒性与早期MLX方法结合,在合成和真实世界基准测试中均取得了最先进的结果。