Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a singular backward pass. Through extensive evaluations against existing methods on Llama 2, Flan-T5 and the Vision Transformer architecture, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an open-source implementation on GitHub https://github.com/rachtibat/LRP-for-Transformers.
翻译:大型语言模型容易产生带有偏见的预测和幻觉现象,这凸显了理解其模型内部推理过程的极端重要性。然而,为整个黑箱Transformer模型实现可信的属性归因,同时保持计算效率,仍是一个未解的难题。通过将层级相关性传播(Layer-wise Relevance Propagation)归因方法扩展至注意力层,我们有效应对了这些挑战。尽管存在部分解决方案,但我们的方法是首个能够以类似单次反向传播的计算效率,对Transformer模型的输入和潜在表征同时进行完整可信归因的方法。通过在Llama 2、Flan-T5和视觉Transformer架构上与现有方法的广泛评估,我们证明所提出的方法在归因可信度方面优于替代方案,并能支持潜在表征的理解,从而为基于概念的解释开辟了道路。我们在GitHub上提供了开源实现:https://github.com/rachtibat/LRP-for-Transformers。