Large language models (LLMs) can explain their own predictions, through post-hoc or Chain-of-Thought (CoT) explanations. However the LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of either post-hoc or CoT explanations. In this paper we argue that existing faithfulness tests are not actually measuring faithfulness in terms of the models' inner workings, but only evaluate their self-consistency on the output level. The aims of our work are two-fold. i) We aim to clarify the status of existing faithfulness tests in terms of model explainability, characterising them as self-consistency tests instead. This assessment we underline by constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open-source LLMs and 5 datasets -- including ii) our own proposed self-consistency measure CC-SHAP. CC-SHAP is a new fine-grained measure (not test) of LLM self-consistency that compares a model's input contributions to answer prediction and generated explanation. With CC-SHAP, we aim to take a step further towards measuring faithfulness with a more interpretable and fine-grained method. Code available at \url{https://github.com/Heidelberg-NLP/CC-SHAP}
翻译:大型语言模型(LLMs)能够通过事后解释或思维链(CoT)解释来解释自身预测结果。然而,LLM可能编造出听起来合理但与其潜在推理不忠实的解释。近期研究设计了一系列测试,旨在评判事后解释或CoT解释的忠实性。本文认为,现有忠实性测试实际上并未从模型内部工作机制角度度量忠实性,仅评估了其在输出层面的自一致性。本文目标分为两方面:i)旨在厘清现有忠实性测试在模型可解释性方面的定位,将其界定为自一致性测试。我们通过构建自一致性测试的比较一致性基准(Comparative Consistency Bank)来强调这一评估——该基准首次在包含11个开源LLM和5个数据集的通用平台上比较现有测试——包括ii)我们自身提出的自一致性度量指标CC-SHAP。CC-SHAP是一种新的细粒度LLM自一致性度量方法(非测试),它比较模型在答案预测与生成解释中的输入贡献。借助CC-SHAP,我们旨在通过更具可解释性和细粒度的方法,向忠实性度量迈进一步。代码见 \url{https://github.com/Heidelberg-NLP/CC-SHAP}