In many scenarios, the interpretability of machine learning models is a highly required but difficult task. To explain the individual predictions of such models, local model-agnostic approaches have been proposed. However, the process generating the explanations can be, for a user, as mysterious as the prediction to be explained. Furthermore, interpretability methods frequently lack theoretical guarantees, and their behavior on simple models is frequently unknown. While it is difficult, if not impossible, to ensure that an explainer behaves as expected on a cutting-edge model, we can at least ensure that everything works on simple, already interpretable models. In this paper, we present a theoretical analysis of Anchors (Ribeiro et al., 2018): a popular rule-based interpretability method that highlights a small set of words to explain a text classifier's decision. After formalizing its algorithm and providing useful insights, we demonstrate mathematically that Anchors produces meaningful results when used with linear text classifiers on top of a TF-IDF vectorization. We believe that our analysis framework can aid in the development of new explainability methods based on solid theoretical foundations.
翻译:在许多场景中,机器学习模型的可解释性是一项高度需求但困难的任务。为了解释这类模型的个体预测,人们提出了局部模型无关方法。然而,对用户而言,生成解释的过程可能与被解释的预测本身一样神秘。此外,可解释性方法常常缺乏理论保证,且其在简单模型上的行为通常未知。虽然确保解释器在尖端模型上如预期般运作是困难甚至不可能的,但我们至少可以确保一切在简单且已可解释的模型上有效。本文对锚点方法(Ribeiro et al., 2018)进行了理论分析:这是一种流行的基于规则的可解释性方法,通过突出显示一小部分词语来解释文本分类器的决策。在形式化其算法并提供有用见解后,我们在数学上证明了:当锚点方法与基于TF-IDF向量化的线性文本分类器结合使用时,能产生有意义的结果。我们相信,我们的分析框架有助于基于坚实理论基础开发新的可解释性方法。