Are self-explanations from Large Language Models faithful?

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, importance measure, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B.

翻译：指令微调的大型语言模型（LLMs）在许多任务上表现卓越，甚至会解释其推理过程，即所谓的自我解释。然而，具有说服力但错误的自我解释会导致对LLMs的盲目信任，从而增加风险。因此，衡量自我解释是否真实反映模型行为至关重要。这种度量被称为可解释性忠实度，但由于无法获取真实解释，且许多LLMs仅提供推理API接口，评估工作极具挑战性。为此，我们提出采用自洽性检验来衡量忠实度。例如，若LLM声称某组词语对预测至关重要，那么缺少这些词语时模型应无法做出预测。尽管自洽性检验是评估忠实度的常用方法，但此前尚未成功应用于反事实解释、重要性度量和删改解释等LLM自我解释场景。实验结果表明，忠实度取决于具体的解释类型、模型架构和任务领域，因此不能笼统地信任自我解释。以情感分类任务为例，反事实解释对Llama2更忠实，重要性度量对Mistral更可靠，而删改解释则更适合Falcon 40B。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日