Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, pinpointing a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.
翻译:大语言模型能够生成看似合理的自由文本自我解释来为其答案提供依据。然而,这些自然语言解释可能无法准确反映模型的实际推理过程,这表明其缺乏忠实性。现有的忠实性评估方法主要依赖于行为测试或计算块分析,而未考察内部神经表征的语义内容。本文提出神经信仰,这是一个灵活的框架,通过识别解释中的关键概念,并机制性地测试这些概念是否实际影响模型的预测,来衡量大语言模型自由文本自我解释的忠实性。我们展示了神经信仰在双跳推理和分类任务中的通用性。此外,我们基于神经信仰开发了一种线性忠实性探针,用于从表征空间中检测不忠实的自我解释,并通过引导来提高忠实性。神经信仰为评估和增强大语言模型自由文本自我解释的忠实性提供了一种原则性方法,满足了可信人工智能系统的关键需求。