Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a single backward pass. Through extensive evaluations against existing methods on LLaMa 2, Mixtral 8x7b, Flan-T5 and vision transformer architectures, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an LRP library at https://github.com/rachtibat/LRP-eXplains-Transformers.
翻译:大型语言模型易产生有偏预测与幻觉,这凸显了理解其内部推理过程的至关重要性。然而,为整个黑盒Transformer模型实现忠实归因并保持计算效率仍是一个未解决的挑战。通过将分层相关性传播归因方法扩展至注意力层处理,我们有效地应对了这些挑战。尽管存在部分解决方案,但我们的方法是首个能够以近似单次反向传播的计算效率,对Transformer模型的输入及潜在表征进行忠实且整体性归因的技术。通过在LLaMa 2、Mixtral 8x7b、Flan-T5及视觉Transformer架构上对现有方法进行广泛评估,我们证明所提方法在忠实性方面超越现有方案,并能实现潜在表征的可解释性,为基于概念的解释开辟了新途径。我们在https://github.com/rachtibat/LRP-eXplains-Transformers 提供了LRP开源工具库。