Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
翻译:忠实性无疑是评估可解释AI可靠性的最关键指标。在自然语言处理领域,当前用于忠实性评估的方法存在诸多不一致性和偏差,往往无法捕捉模型的真实推理过程。我们引入对抗敏感性作为忠实性评估的新方法,重点关注解释器在模型遭受对抗攻击时的响应。该方法通过捕捉解释器对对抗性输入变化的敏感性来评估其忠实性。本研究解决了现有评估技术的重要局限性,并从关键但尚未充分探索的范式角度对忠实性进行了量化。