Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines $π$0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.
翻译:用于闭环机器人控制的视觉-语言-动作模型通常在马尔可夫假设下构建,这使其在执行需要历史上下文的任务时容易出错。为了融入记忆,现有的VLA模型要么从记忆库中检索(可能被干扰项误导),要么扩展帧窗口(其固定范围仍限制了长期记忆)。本文提出ReMem-VLA,一种循环记忆VLA模型,配备两组可学习查询:帧级循环记忆查询用于在连续帧间传播信息以支持短期记忆,以及块级循环记忆查询用于在时间块间传递上下文以实现长期记忆。这些查询经过端到端训练,以随时间聚合并保持相关上下文,隐式地指导模型决策,而无需额外的训练或推理成本。此外,为增强视觉记忆,我们引入过去观测预测作为辅助训练目标。通过大量以记忆为中心的仿真和真实世界机器人实验,我们证明ReMem-VLA在多个维度上展现出强大的记忆能力,包括空间、序列、情景、时间和视觉记忆。ReMem-VLA显著优于无记忆VLA基线 $π$0.5 和 OpenVLA-OFT,并在依赖记忆的任务上大幅超越MemoryVLA。