Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.
翻译:大语言模型(LLMs)通常依赖于从过往交互中提炼出的用户特定记忆以实现个性化生成。常见的做法是将这些记忆与输入提示进行拼接,但这种方法会迅速耗尽设备端大语言模型有限的上下文容量。通过平均化压缩记忆可以缓解上下文的增长,然而由于异构记忆间的语义冲突,这种做法常常损害模型性能。本文提出一种基于聚类的记忆压缩策略,以平衡上下文效率与个性化质量。我们的方法根据相似性对记忆进行分组,并在拼接前于各簇内合并记忆,从而在减少冗余的同时保持语义连贯性。实验表明,相较于基线策略(如简单平均或直接拼接),我们的方法显著降低了记忆标记的数量且性能更优。此外,在固定的上下文预算下,聚类驱动的合并能产生更紧凑的记忆表示,并持续提升生成质量。