ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems

Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an Least Recently Used (LRU) memory module using replacement or convex blending. ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines, achieving the best success rate on 21 out of 23 tasks and improving the aggregate success rate across all tasks by about 70% over the previous best baseline. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability. Code and project page: https://elmur-paper.github.io/.

翻译：现实世界中的机器人智能体必须在部分可观测性和长时域条件下行动，其中关键线索可能远在影响决策之前就已出现。然而，大多数现代方法仅依赖瞬时信息，未能整合来自过去的洞察。标准的循环或Transformer模型在保持和利用长期依赖关系方面存在困难：上下文窗口会截断历史，而简单的记忆扩展方法在规模化和稀疏性条件下失效。我们提出ELMUR（外部层记忆更新/重写机制），一种具有结构化外部记忆的Transformer架构。每一层维护记忆嵌入，通过双向交叉注意力与之交互，并利用最近最少使用（LRU）记忆模块通过替换或凸混合方式更新记忆。ELMUR将有效时域扩展至注意力窗口的100,000倍以上，并在走廊长度高达一百万步的合成T迷宫任务中实现了100%的成功率。在POPGym环境中，其在超过一半的任务上超越了基线方法。在基于视觉观测的MIKASA-Robo稀疏奖励操作任务中，其性能接近强基线的两倍，在23项任务中的21项上取得了最佳成功率，并将所有任务的总成功率较先前最佳基线提升了约70%。这些结果表明，结构化的、层局部外部记忆为部分可观测性下的决策提供了一种简单且可扩展的解决方案。代码与项目页面：https://elmur-paper.github.io/。