Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
翻译:大型语言模型(LLMs)在众多自然语言处理任务中表现出色,但由于对长对话历史的注意力有限,难以维持长期交互。检索增强生成(RAG)缓解了这一问题,但缺乏更新或优化存储记忆的可靠机制,导致模式驱动的幻觉、低效的写入操作以及对多模态推理的支持不足。为应对这些挑战,我们提出了TeleMem,一个统一的长期多模态记忆系统,通过叙事动态提取维护连贯的用户画像,确保仅保留基于对话的信息。TeleMem进一步引入了结构化写入流程,对记忆条目进行批处理、检索、聚类与整合,显著提升了存储效率、减少了令牌使用并加速了记忆操作。此外,结合ReAct式推理的多模态记忆模块为系统配备了闭环的观察、思考与行动流程,使其能够准确理解长期语境下的复杂视频内容。实验结果表明,在ZH-4O长期角色扮演游戏基准测试中,TeleMem以19%的准确率提升、43%的令牌减少量及2.1倍的加速比超越了当前最先进的Mem0基线模型。