Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

Shenglai Zeng,Tianqi Zheng,Chuan Tian,Dante Everaert,Yau-Shian Wang,Yupin Huang,Michael J. Morais,Rohit Patki,Jinjin Tian,Xinnan Dai,Kai Guo,Monica Xiao Cheng,Hui Liu

Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.

翻译：将大语言模型（LLMs）个性化适配至个体用户需要整合大量的交互历史与用户画像，但由于输入令牌的限制，这种做法会因高推理延迟和API成本而变得不切实际。现有方法依赖于启发式策略，例如选择最近的交互记录或提示摘要模型来压缩用户画像。然而，这些方法将上下文视为一个整体，未能考虑大语言模型内部如何处理及优先排序画像中的不同组成部分。我们研究了大语言模型的注意力模式是否能有效识别重要的个性化信号，从而实现智能的上下文压缩。通过对代表性个性化任务的初步研究，我们发现：（a）大语言模型的注意力模式能自然地揭示重要信号；（b）微调能增强大语言模型区分相关信息与无关信息的能力。基于这些发现，我们提出了Attn-GS，一个注意力引导的上下文压缩框架。该框架利用来自标记模型的注意力反馈来标记重要的个性化语句，进而引导一个压缩模型生成与任务相关、高质量的压缩后用户上下文。大量实验表明，在不同任务、令牌限制和设置下，Attn-GS均显著优于多种基线方法，在将令牌使用量减少50倍的同时，实现了接近使用完整上下文的性能。