Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

Many robotic tasks require short-term memory, whether it's retrieving an object that's no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning rely only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, (ii) a hierarchical architecture that first compresses local information into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes. To systematically evaluate memory in visuomotor control, we introduce ReMemBench -- a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory -- designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including recurrent architectures, transformers, and their variants -- achieving an absolute improvement of 5%--12% over the strongest baseline. On the RoboCasa and LIBERO benchmarks, it achieves absolute improvements of 11%--15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory-augmented visuomotor policies that scale to long-horizon tasks. Additional materials are available at https://shahrutav.github.io/short-term-memory

翻译：许多机器人任务需要短期记忆，无论是回忆暂时不可见的物体，还是在设定时间后关闭电器。然而，通过模仿学习训练的视觉运动策略大多仅依赖即时感官输入，未能利用过往经验指导决策。我们提出PRISM——一种基于Transformer的视觉运动策略架构，通过两个关键组件有效利用短期记忆：（i）门控注意力机制，通过过滤检索信息以抑制无关细节，减少历史信息与当前动作预测之间的虚假关联，从而提升性能；（ii）层次化架构，首先将局部信息压缩为紧凑令牌，再通过整合这些令牌捕捉时间延展依赖关系，优化计算与内存占用。这些机制共同使视觉运动策略的短期记忆可扩展至两分钟。为系统评估视觉运动控制中的记忆能力，我们引入ReMemBench——一项涵盖八种多样化家务操作任务的基准测试，这些任务分为四类短期记忆场景，旨在促进通用记忆机制而非孤立的任务特定解决方案的开发。PRISM始终优于现有工作（包括循环架构、Transformer及其变体），相比最强基线实现5%-12%的绝对提升。在RoboCasa和LIBERO基准测试中，尽管未利用任何大规模预训练，PRISM相比其无记忆变体及经过微调的视觉-语言-动作基线（如GR00T-N1-3B和OpenVLA）仍取得11%-15%的绝对提升。PRISM与ReMemBench共同为开发与评估可扩展至长周期任务的短期记忆增强视觉运动策略奠定了基础。补充材料请访问https://shahrutav.github.io/short-term-memory