The influence of contextual input on the behavior of large language models (LLMs) has prompted the development of context attribution methods that aim to quantify each context span's effect on an LLM's generations. The leave-one-out (LOO) error, which measures the change in the likelihood of the LLM's response when a given span of the context is removed, provides a principled way to perform context attribution, but can be prohibitively expensive to compute for large models. In this work, we introduce AttriBoT, a series of novel techniques for efficiently computing an approximation of the LOO error for context attribution. Specifically, AttriBoT uses cached activations to avoid redundant operations, performs hierarchical attribution to reduce computation, and emulates the behavior of large target models with smaller proxy models. Taken together, AttriBoT can provide a >300x speedup while remaining more faithful to a target model's LOO error than prior context attribution methods. This stark increase in performance makes computing context attributions for a given response 30x faster than generating the response itself, empowering real-world applications that require computing attributions at scale. We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods.
翻译:大型语言模型(LLM)中上下文输入对其行为的影响,推动了上下文归因方法的发展,这些方法旨在量化每个上下文片段对LLM生成内容的影响。留一法(LOO)误差通过测量移除特定上下文片段时LLM响应似然的变化,为上下文归因提供了原则性方法,但对于大型模型而言,其计算成本可能过高。本研究提出AttriBoT,一系列高效计算上下文归因中LOO误差近似值的新技术。具体而言,AttriBoT利用缓存激活值避免冗余计算,采用分层归因降低运算量,并通过小型代理模型模拟大型目标模型的行为。综合这些技术,AttriBoT在保持比现有上下文归因方法更贴近目标模型LOO误差的同时,能实现超过300倍的加速。这种性能的显著提升使得计算给定响应的上下文归因比生成响应本身快30倍,从而赋能需要大规模计算归因的实际应用。我们发布了用户友好且高效的AttriBoT实现,以促进高效LLM可解释性研究,并鼓励未来高效上下文归因方法的进一步发展。