Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.
翻译:许多灵巧操控任务本质上是非马尔可夫的,但在近期视觉-语言-动作(VLA)范式的兴起中,这一事实并未得到充分关注。尽管现有VLA模型成功将互联网规模的语义理解引入机器人领域,但其本质上是“无状态的”,难以应对依赖记忆的长周期任务。在本工作中,我们通过整合语言暂存器,探索了一种为VLA模型赋予空间与时间记忆能力的方法。该暂存器可记录任务特定信息(如物体位置),并帮助模型跟踪计划及其子目标进度。我们在ClevrSkills环境中的记忆依赖任务子集、MemoryBench基准测试以及一项具有挑战性的真实世界抓取放置任务上评估了该方法。结果表明,对于非递归模型和递归模型,引入语言暂存器均能显著提升任务泛化能力。