Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.
翻译:大型语言模型(LLMs)在自然语言处理领域展现出卓越的能力。然而,其内部机制尚不明确,这种不透明性为下游应用带来了潜在风险。因此,理解并解释这些模型对于阐明其行为、局限性及社会影响至关重要。本文引入了一种可解释性技术分类体系,并系统概述了基于Transformer的语言模型解释方法。我们根据LLMs的训练范式对技术进行分类:传统微调范式和提示范式。针对每种范式,我们总结了生成单个预测的局部解释和模型整体知识的全局解释的目标与主流方法。我们还讨论了评估生成解释的指标,以及如何利用解释来调试模型并提升性能。最后,我们探讨了LLM时代解释技术相较于传统机器学习模型所面临的关键挑战与新兴机遇。