Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.
翻译:大语言模型在自然语言处理领域的众多下游任务中展现出卓越性能。然而,语言模型如何预测下一个词元并生成内容,通常难以被人类理解。此外,这些模型在预测和推理过程中常出现错误,即所谓的"幻觉"现象。这些错误凸显了深入理解和解释语言模型复杂内部工作机制及其预测输出生成方式的迫切需求。基于这一研究空白,本文以基于Transformer架构的大语言模型为研究对象,探讨局部可解释性与机制可解释性方法,旨在增强对此类模型的信任度。为此,本文力求实现三个关键贡献:首先,系统梳理局部可解释性与机制可解释性的研究方法,并综述相关文献中的核心发现;其次,通过医疗健康与自动驾驶两个关键领域的实验研究,分析大语言模型在可解释性与推理方面的表现,并评估解释接收者对此类解释的信任影响;最后,总结当前大语言模型可解释性发展中尚未解决的关键问题,并展望生成符合人类认知、具备可信度的大语言模型解释所面临的机遇、核心挑战与未来研究方向。