Evaluating the False Trust engendered by LLM Explanations

Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model's answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model's computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users' ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations - reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI's answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users' ability to distinguish correct from incorrect AI outputs.

翻译：大语言模型(LLMs)和大型推理模型(LRMs)日益被用于关键任务，但它们对解决方案的正确性不作任何保证。用户必须借助推理轨迹、其摘要或事后生成的解释来决定是否信任模型的答案。这些推理轨迹尽管被证明既不是模型计算过程的忠实表征，也不必然具有语义意义，却常被当作溯因解释。目前尚不清楚解释或推理轨迹是否有助于用户识别AI的错误答案，抑或只是单纯地说服用户无论对错都信任AI。本文采用以用户为中心的方法，开发了一套评估协议，研究不同解释类型如何影响用户判断AI生成答案正确性的能力，以及如何引发对用户的虚假信任。我们进行了一项组间用户研究，模拟用户无法验证解决方案的场景，并分析了常用大语言模型解释（推理轨迹、其摘要和事后解释）所引发的虚假信任。我们还测试了一种对比性双重解释设置，即同时呈现支持与反对AI答案的论点。研究发现，推理轨迹和事后解释具有说服力但缺乏信息量：它们会增加用户对LLM预测的接受度，无论预测正确与否。相比之下，双重解释是唯一能够真正提高用户区分AI输出正确与否的能力的条件。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大语言模型中的隐式推理：综合综述

专知会员服务

33+阅读 · 2025年9月4日

大模型如何可信？113页《TRUSTLLM：大型语言模型中的可信度》论文，60多位作者40机构联合撰写

专知会员服务

66+阅读 · 2024年1月13日