Support-Contra Asymmetry in LLM Explanations

Large Language Models (LLMs) increasingly produce natural language explanations alongside their predictions, yet it remains unclear whether these explanations reference predictive cues present in the input text. In this work, we present an empirical study of how LLM-generated explanations align with predictive lexical evidence from an external model in text classification tasks. To analyze this relationship, we compare explanation content against interpretable feature importance signals extracted from transparent linear classifiers. These reference models allow us to partition predictive lexical cues into supporting and contradicting evidence relative to the predicted label. Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry. Explanations accompanying correct predictions tend to reference more supporting lexical cues and fewer contradicting cues, whereas explanations associated with incorrect predictions reference substantially more contradicting evidence. This pattern appears consistently across datasets, across reference model families (logistic regression and linear SVM), and across multiple feature retrieval depths. These results suggest that LLM explanations often reflect lexical signals that are predictive for the task when predictions are correct, while incorrect predictions are more frequently associated with explanations that reference misleading cues present in the input. Our findings provide a simple empirical perspective on explanation-evidence alignment and illustrate how external sources of predictive evidence can be used to analyze the behavior of LLM-generated explanations.

翻译：大语言模型（LLM）在生成预测的同时，越来越多地产生自然语言解释，但这些解释是否引用了输入文本中存在的预测性线索仍不明确。本文针对文本分类任务，实证研究了LLM生成解释与外部模型提供的预测性词汇证据之间的一致性。为分析这种关系，我们将解释内容与透明线性分类器提取的可解释特征重要性信号进行比较。这些参考模型使我们能够将预测性词汇线索划分为与预测标签相关支持性和矛盾性证据。在三个基准数据集——WIKIONTOLOGY、AG NEWS和IMDB上，我们观察到一种一致的实证模式，称之为支持-矛盾不对称性。伴随正确预测的解释往往引用更多支持性词汇线索和更少的矛盾性线索，而伴随错误预测的解释则引用了显著更多的矛盾性证据。这一模式在不同数据集、不同参考模型族（逻辑回归与线性SVM）以及多个特征检索深度下均持续存在。这些结果表明：当预测正确时，LLM解释通常反映对任务具有预测性的词汇信号；而错误预测更常与引用输入中误导性线索的解释相关联。我们的发现为解释-证据对齐提供了简单的实证视角，并展示了如何利用外部预测性证据来源分析LLM生成解释的行为。