Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.
翻译:指令调优的大型语言模型能够通过生成自我解释,向用户提供关于其输出的解释,而无需应用复杂的可解释性技术。在本文中,我们分析了这种能力是否能够产生良好的解释。我们以输入理性(rationales)的形式评估自我解释对人类而言的可信度(plausibility)。我们研究了三个文本分类任务:情感分类、强迫劳动检测和声明验证。我们纳入了情感分类任务的丹麦语和意大利语翻译版本,并将自我解释与人工标注进行比较。为此,我们为声明验证数据集Climate-Fever收集了人类理性标注。此外,我们还评估了人类理性与自我解释理性相对于正确模型预测的忠实性(faithfulness),并通过纳入事后归因解释来扩展研究。我们分析了四个开放权重的LLM,发现自我解释与人类理性之间的一致性高度依赖于文本长度和任务复杂性。尽管如此,自我解释产生了忠实的词元级理性子集,而事后归因方法则倾向于强调结构和格式词元,这反映了根本不同的解释策略。