This survey paper delves into the burgeoning field of explainability for Large Language Models (LLMs), a critical yet challenging aspect of natural language processing. With LLMs playing a pivotal role in various applications, their "black-box" nature raises concerns about transparency and ethical use. This paper emphasizes the necessity for enhanced explainability in LLMs, addressing both the general public's trust and the technical community's need for a deeper understanding of these models. We concentrate on pre-trained Transformer-based LLMs, such as LLaMA, which present unique interpretability challenges due to their scale and complexity. Our review categorizes existing explainability methods and discusses their application in improving model transparency and reliability. We also discuss representative evaluation methods, highlighting their strengths and limitations. The goal of this survey is to bridge the gap between theoretical understanding and practical application, offering insights for future research and development in the field of LLM explainability.
翻译:本综述论文深入探讨了大型语言模型(LLM)可解释性这一新兴领域——这是自然语言处理中一个关键但充满挑战的方向。随着LLM在各种应用场景中发挥核心作用,其"黑箱"特性引发了关于透明度和伦理使用的担忧。本文强调增强LLM可解释性的必要性,既回应公众对模型信任的需求,也满足技术社区深入理解这些模型的诉求。我们聚焦于基于Transformer的预训练LLM(如LLaMA),这些模型因其规模与复杂性而面临独特的可解释性挑战。本文对现有可解释性方法进行了分类,并讨论它们在提升模型透明度与可靠性方面的应用。同时,我们分析了代表性评估方法,指出其优势与局限性。本综述旨在弥合理论理解与实际应用之间的鸿沟,为LLM可解释性领域的未来研究与发展提供见解。