Recursive Belief Vision Language Action Models

Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long-horizon manipulation fundamentally requires persistent, action-conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once per task, the VLM provides high-level intent, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5 percent and 37.5 percent higher success rates on multi-stage pick-and-place and stacking tasks, respectively, compared to pi_0. It also reduces inference latency by up to five times relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show the belief module is the primary driver of performance, increasing success rates from 32.5 percent without belief to 77.5 percent with belief.

翻译：视觉语言动作模型必须使智能体能够在部分可观测条件下执行长时程任务。然而，现有方法大多仍以观测驱动为主，依赖于短上下文窗口或对视觉语言模型的重复查询。这导致任务进度丢失、感知混淆下的动作重复以及高推理延迟。尽管语义基础至关重要，但长时程操作本质上需要持久且以动作为条件的状态表征。当前的视觉语言动作模型缺乏此类表征，并表现出有限的时序与物理推理能力，使其难以适用于多阶段控制任务。本文提出RB-VLA，一种以信念为中心的架构，通过自监督世界模型目标进行训练，能够维持一个紧凑的潜在状态，编码任务相关的历史信息、动态特性及物体交互关系。视觉语言模型在每个任务中仅被查询一次以提供高层意图，而信念模块则跟踪任务进度，并在部分可观测条件下实现基于阶段感知、因果关联的控制，无需存储原始观测数据或随时间扩展内存。信念与意图共同作为扩散策略的条件，以实现鲁棒的闭环执行。在长时程基准测试中，RB-VLA优于先前的视觉语言动作模型，在多阶段抓放与堆叠任务上分别实现了比基线π_0高出52.5%和37.5%的成功率。相较于基线方法，其推理延迟最高降低至五分之一，并消除了现有视觉语言动作模型中随时间步增长的内存开销。消融实验表明，信念模块是性能提升的主要驱动力，将无信念时的32.5%成功率提升至有信念时的77.5%。