Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.
翻译:大语言模型(LLM)智能体因其固有的无状态特性,在长时域任务中表现困难,所有任务相关信息需编码至不断增长的输入上下文。由此导致的推理质量下降、推理成本增加及延迟升高,要求具备高效的工作记忆机制。然而,现有方法要么依赖有损压缩,要么基于相似性检索,往往难以捕获多步骤智能体任务所需的时间结构与因果依赖关系。本文提出HORMA——一种分层组织与检索记忆智能体,将经验组织成类文件系统的层级结构,其中摘要化实体与对应原始轨迹相关联,在保留细节信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段:结构化记忆构建与基于导航的检索。构建模块通过区分由信息缺失导致的失败与由误导性或超载上下文导致的失败,迭代优化经验的组织方式。导航模块利用经强化学习训练的轻量级智能体遍历层级结构,选择最小且充分的上下文进行检索,从而缩短关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval基准测试中,HORMA在受约束的上下文预算下提升了任务性能,且在长对话任务中最多仅需基线22.17%的令牌用量。相较于现有方法,该方法始终实现更优的效率-性能权衡,并能有效泛化至未见任务。