MemoryVLA：用于机器人操作的视觉-语言-动作模型中的感知-认知记忆 (MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation)

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, LIBERO-5 suites and Mikasa-Robo, it achieves 71.9%, 72.7%, 96.5%, and 41.2% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge and +11.8 gain on Mikasa-Robo. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

翻译：时间上下文对于机器人操作至关重要，因为此类任务本质上是非马尔可夫的，然而主流的VLA模型通常忽略这一点，并且在处理长时程、时间依赖的任务时存在困难。认知科学表明，人类依赖工作记忆来缓冲短暂的表征以进行即时控制，而海马系统则保存过去经验的逐字情景细节和语义要点以形成长期记忆。受这些机制启发，我们提出了MemoryVLA，一个用于长时程机器人操作的认知-记忆-动作框架。一个预训练的VLM将观察编码为感知和认知令牌，形成工作记忆；同时，一个感知-认知记忆库存储从工作记忆中整合的低层细节和高层语义。工作记忆从记忆库中检索与决策相关的条目，将其与当前令牌自适应融合，并通过合并冗余项来更新记忆库。利用这些令牌，一个记忆条件扩散动作专家生成具有时间感知的动作序列。我们在三个机器人上超过150个仿真和真实世界任务中评估了MemoryVLA。在SimplerEnv-Bridge、Fractal、LIBERO-5套件和Mikasa-Robo上，其成功率分别达到71.9%、72.7%、96.5%和41.2%，均优于最先进的基线CogACT和pi-0，在Bridge任务上显著提升了+14.6，在Mikasa-Robo任务上提升了+11.8。在涵盖通用技能和长时程时间依赖的12个真实世界任务中，MemoryVLA实现了84.0%的成功率，其中长时程任务相比最先进的基线提升了+26。项目页面：https://shihao1895.github.io/MemoryVLA