Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

翻译：与AI智能体的长对话为用户带来了一个简单问题：历史记录虽有用，但逐字携带成本高昂。本研究聚焦个性化智能体记忆：将用户与智能体的对话历史蒸馏为紧凑的检索层以供后续搜索。每个对话轮次被压缩为包含四个字段的复合对象（exchange_core、specific_context、thematic_room_assignments及regex-extracted files_touched）。可搜索的蒸馏文本平均每轮次仅需38个令牌。该方法应用于6个软件工程项目的4,182段对话（14,340个轮次），将平均轮次长度从371令牌缩减至38令牌，实现11倍压缩。我们通过201个面向回忆的查询、涵盖5种纯检索与5种跨层检索模式的107种配置，以及5个LLM评分器（214,519组共识评分的查询-结果对），评估个性化回忆能力在压缩后的保留情况。最佳纯蒸馏配置达到最佳逐字检索MRR的96%（0.717对比0.745）。结果呈现机制依赖性：所有20种向量检索配置在Bonferroni校正后均未出现显著差异，而所有20种BM25配置均显著退化（效应量|d|=0.031-0.756）。最佳跨层设置略优于最佳纯逐字基线（MRR 0.759）。结构化蒸馏可在不系统性牺牲检索质量的前提下压缩单用户智能体记忆。以1/11的上下文成本，数千对话轮次可容纳于单个提示中，同时逐字源数据仍可供深度追溯。我们将实现与分析流程作为开源软件发布。