This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
翻译:本文研究了大型语言模型(LLMs)在被要求解释其先前输出时生成解释的可靠性。我们在两种不同分类任务(客观性与主观性)上,使用三种最先进的大型语言模型(参数量为20亿至80亿),评估了两种此类自我解释——抽取式解释与反事实解释。我们的研究结果表明,虽然这些自我解释可能与人类判断存在相关性,但它们并未完整且准确地反映模型的决策过程,这表明模型的感知推理与实际推理之间存在差距。我们证明这一差距可以被弥合,因为通过提示大型语言模型生成反事实解释能够产生忠实、信息丰富且易于验证的结果。这些反事实解释为传统可解释性方法(如SHAP、LIME)提供了一种有前景的替代方案,前提是提示需针对具体任务进行定制并经过有效性检验。