Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.
翻译:大型语言模型(LLM)通过利用广泛的上下文信息生成输出,这些上下文通常包含来自提示、检索段落和交互历史的冗余信息。在关键应用中,识别哪些上下文元素实际影响输出至关重要,因为标准解释方法难以处理冗余和重叠的上下文。输入的微小变化可能导致归因分数的不可预测偏移,从而削弱可解释性并引发对提示注入等风险的担忧。本研究致力于解决区分关键上下文元素与相关元素的挑战。我们提出RISE(冗余不敏感解释评分)方法,该方法量化每个输入相对于其他输入的独特影响,最小化冗余的影响并提供更清晰、稳定的归因。实验表明,与传统方法相比,RISE提供了更稳健的解释,强调了条件信息对于可信赖的LLM解释和监控的重要性。