Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

Shahin Atakishiyev,Housam K. B. Babiker,Jiayi Dai,Nawshad Farruque,Teruaki Hayashi,Nafisa Sadaf Hriti,Md Abed Rahman,Iain Smith,Mi-Young Kim,Osmar R. Zaïane,Randy Goebel

Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

翻译：大语言模型在自然语言处理领域的众多下游任务中展现出卓越性能。然而，语言模型如何预测下一个词元并生成内容，通常难以被人类理解。此外，这些模型在预测和推理过程中常出现错误，即所谓的"幻觉"现象。这些错误凸显了深入理解和解释语言模型复杂内部工作机制及其预测输出生成方式的迫切需求。基于这一研究空白，本文以基于Transformer架构的大语言模型为研究对象，探讨局部可解释性与机制可解释性方法，旨在增强对此类模型的信任度。为此，本文力求实现三个关键贡献：首先，系统梳理局部可解释性与机制可解释性的研究方法，并综述相关文献中的核心发现；其次，通过医疗健康与自动驾驶两个关键领域的实验研究，分析大语言模型在可解释性与推理方面的表现，并评估解释接收者对此类解释的信任影响；最后，总结当前大语言模型可解释性发展中尚未解决的关键问题，并展望生成符合人类认知、具备可信度的大语言模型解释所面临的机遇、核心挑战与未来研究方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日