Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a "what you see is what you get" property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to show that despite prompting an LLM to produce non-encoding explanations for a sentiment analysis task, the LLM-generated explanations encode.
翻译:特征归因试图突出哪些输入驱动了预测能力。因此,好的归因或解释是那些能产生保留这种预测能力的输入的解释;相应地,对解释的评估通过其预测质量进行评分。然而,对于一类称为编码解释的解释,评估产生的分数似乎超过了从解释值本身可能获得的分数。探测编码现象仍然是一个挑战,因为缺乏对额外预测能力来源的普遍性描述。我们提出了一个通过条件依赖性来识别这种额外预测能力的编码定义,并证明该定义符合现有的编码实例。与编码解释相反,该定义意味着非编码解释包含了用于生成解释的所有信息输入,赋予它们一种"所见即所得"的特性,这使得它们透明且易于使用。接着,我们证明了现有评分方法(ROAR、FRESH、EVAL-X)无法将非编码解释的排名置于编码解释之上,并开发了能够正确排序的STRIPE-X方法。在通过实验验证理论见解后,我们使用STRIPE-X证明:尽管提示大型语言模型为情感分析任务生成非编码解释,但模型生成的解释仍具有编码特性。