Anchors (Ribeiro et al., 2018) is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. After formalizing the algorithm for text classification, we present explicit results on different classes of models when the vectorization step is TF-IDF, and words are replaced by a fixed out-of-dictionary token when removed. Our inquiry covers models such as elementary if-then rules and linear classifiers. We then leverage this analysis to gain insights on the behavior of Anchors for any differentiable classifiers. For neural networks, we empirically show that the words corresponding to the highest partial derivatives of the model with respect to the input, reweighted by the inverse document frequencies, are selected by Anchors.
翻译:锚点(Ribeiro等人,2018)是一种事后、基于规则的可解释性方法。对于文本数据,该方法通过高亮显示一组少量词汇(即锚点)来解释模型决策,当这些词汇出现在文档中时,被解释模型将产生相似的输出。本文首次对锚点进行理论分析,假设最优锚点的搜索是穷举式的。在将文本分类算法形式化后,我们针对采用TF-IDF向量化且移除词汇时替换为固定非词典标记的不同模型类别,给出了显式结果。我们的研究涵盖基础if-then规则和线性分类器等模型。进而利用该分析洞察锚点在任何可微分类器中的行为。对于神经网络,我们通过实验证明:锚点会选择模型输入对应的最高偏导数值(经逆文档频率重新加权后)所对应的词汇。