DARE: Towards Robust Text Explanations in Biomedical and Healthcare Applications

Along with the successful deployment of deep neural networks in several application domains, the need to unravel the black-box nature of these networks has seen a significant increase recently. Several methods have been introduced to provide insight into the inference process of deep neural networks. However, most of these explainability methods have been shown to be brittle in the face of adversarial perturbations of their inputs in the image and generic textual domain. In this work we show that this phenomenon extends to specific and important high stakes domains like biomedical datasets. In particular, we observe that the robustness of explanations should be characterized in terms of the accuracy of the explanation in linking a model's inputs and its decisions - faithfulness - and its relevance from the perspective of domain experts - plausibility. This is crucial to prevent explanations that are inaccurate but still look convincing in the context of the domain at hand. To this end, we show how to adapt current attribution robustness estimation methods to a given domain, so as to take into account domain-specific plausibility. This results in our DomainAdaptiveAREstimator (DARE) attribution robustness estimator, allowing us to properly characterize the domain-specific robustness of faithful explanations. Next, we provide two methods, adversarial training and FAR training, to mitigate the brittleness characterized by DARE, allowing us to train networks that display robust attributions. Finally, we empirically validate our methods with extensive experiments on three established biomedical benchmarks.

翻译：摘要：随着深度神经网络在多个应用领域的成功部署，揭示这些网络黑箱本质的需求近期显著增加。多种方法已被提出以洞察深度神经网络的推理过程。然而，大多数这些可解释性方法在图像和通用文本领域中，面对其输入的对抗性扰动时已被证明是脆弱的。在本工作中，我们证明这一现象扩展至特定且高风险领域，如生物医学数据集。特别地，我们观察到解释的稳健性应通过解释在连接模型输入与其决策的准确性——即忠实性——以及从领域专家视角的相关性——即似真性——来表征。这对于防止那些不准确但在领域背景下仍具说服力的解释至关重要。为此，我们展示了如何将当前的归因稳健性估计方法适应于给定领域，以考虑领域特定的似真性。由此得到我们的领域自适应归因稳健性估计器（DARE），使我们能够恰当表征忠实解释的领域特定稳健性。接下来，我们提供了两种方法——对抗训练和FAR训练——以缓解DARE所表征的脆弱性，从而训练出具有稳健归因的网络。最后，我们通过在三个已建立的生物医学基准上进行广泛实验，实证验证了我们的方法。