Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

翻译：许多灵巧操控任务本质上是非马尔可夫的，但在近期视觉-语言-动作（VLA）范式的兴起中，这一事实并未得到充分关注。尽管现有VLA模型成功将互联网规模的语义理解引入机器人领域，但其本质上是“无状态的”，难以应对依赖记忆的长周期任务。在本工作中，我们通过整合语言暂存器，探索了一种为VLA模型赋予空间与时间记忆能力的方法。该暂存器可记录任务特定信息（如物体位置），并帮助模型跟踪计划及其子目标进度。我们在ClevrSkills环境中的记忆依赖任务子集、MemoryBench基准测试以及一项具有挑战性的真实世界抓取放置任务上评估了该方法。结果表明，对于非递归模型和递归模型，引入语言暂存器均能显著提升任务泛化能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML 2026】面向视野外操作的VLA空间记忆框架SOMA

专知会员服务

8+阅读 · 5月22日

【ICML 2026】 StableVLA：无需额外数据，基于信息瓶颈的自适应鲁棒性视觉-语言-动作模型

专知会员服务

6+阅读 · 5月19日

机器人领域中的视觉-语言-动作模型：数据集、基准测试与数据引擎综述

专知会员服务

13+阅读 · 4月29日

视觉-语言-动作模型解析：从模块构成到里程碑与挑战

专知会员服务

17+阅读 · 2025年12月17日