While recent works have been considerably improving the quality of the natural language explanations (NLEs) generated by a model to justify its predictions, there is very limited research in detecting and alleviating inconsistencies among generated NLEs. In this work, we leverage external knowledge bases to significantly improve on an existing adversarial attack for detecting inconsistent NLEs. We apply our attack to high-performing NLE models and show that models with higher NLE quality do not necessarily generate fewer inconsistencies. Moreover, we propose an off-the-shelf mitigation method to alleviate inconsistencies by grounding the model into external background knowledge. Our method decreases the inconsistencies of previous high-performing NLE models as detected by our attack.
翻译:尽管近期研究在提升模型生成的用于证明其预测的自然语言解释(NLEs)质量方面取得了显著进展,但针对生成NLEs中不一致性的检测与缓解研究仍十分有限。本研究利用外部知识库,显著改进了现有用于检测不一致NLEs的对抗攻击方法。我们将攻击应用于高性能NLE模型,结果表明,具有更高NLE质量的模型未必会生成更少的不一致性。此外,我们提出了一种现成的缓解方法,通过将模型锚定于外部背景知识来减少不一致性。该方法有效降低了经对抗攻击检测到的先前高性能NLE模型中的不一致性。