Instruction-tuned LLMs are able to provide an explanation about their output to users by generating self-explanations that do not require gradient computations or the application of possibly complex XAI methods. In this paper, we analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales with respect to their plausibility to humans as well as their faithfulness to models. For this, we apply two text classification tasks: sentiment classification and forced labour detection. Next to English, we further include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations for all samples. To allow for direct comparisons, we also compute post-hoc feature attribution, i.e., layer-wise relevance propagation (LRP) and apply this pipeline to 4 LLMs (Llama2, Llama3, Mistral and Mixtral). Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
翻译:指令调优的大型语言模型能够通过生成自解释向用户解释其输出,这种自解释无需梯度计算或应用可能复杂的可解释人工智能方法。本文通过评估以输入解释依据形式呈现的自解释在人类可理解性和模型忠实性方面的表现,分析这种能力是否能够产生良好的解释。为此,我们采用两项文本分类任务:情感分类与强迫劳动检测。除英语外,我们进一步纳入丹麦语和意大利语翻译的情感分类任务,并对所有样本比较自解释与人工标注结果。为进行直接对比,我们还计算了事后特征归因(即逐层相关性传播),并将该流程应用于四种大型语言模型(Llama2、Llama3、Mistral和Mixtral)。研究结果表明,与逐层相关性传播相比,自解释与人工标注的一致性更高,同时保持了相当的忠实性水平。