Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, importance measure, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B.
翻译:指令微调的大型语言模型(LLMs)在许多任务上表现卓越,甚至会解释其推理过程,即所谓的自我解释。然而,具有说服力但错误的自我解释会导致对LLMs的盲目信任,从而增加风险。因此,衡量自我解释是否真实反映模型行为至关重要。这种度量被称为可解释性忠实度,但由于无法获取真实解释,且许多LLMs仅提供推理API接口,评估工作极具挑战性。为此,我们提出采用自洽性检验来衡量忠实度。例如,若LLM声称某组词语对预测至关重要,那么缺少这些词语时模型应无法做出预测。尽管自洽性检验是评估忠实度的常用方法,但此前尚未成功应用于反事实解释、重要性度量和删改解释等LLM自我解释场景。实验结果表明,忠实度取决于具体的解释类型、模型架构和任务领域,因此不能笼统地信任自我解释。以情感分类任务为例,反事实解释对Llama2更忠实,重要性度量对Mistral更可靠,而删改解释则更适合Falcon 40B。