Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.
翻译:当前视觉-语言-动作(VLA)模型在高效动作生成与显式推理之间面临权衡。从视觉-语言骨干网络表征中直接解码动作可实现低延迟控制,而通过文本链、像素级子目标或动作搜索进行显式推理虽能改善规划,但会带来显著延迟和计算成本。本文提出PearlVLA——一种将推理过程迁移至视觉-语言模型(VLM)隐空间的VLA框架。PearlVLA将VLM元查询表征解耦为固定视觉接地分支与迭代式隐规划分支。在每一轮精炼中,由规划条件化的世界查询对轻量级冻结隐世界模型进行探测,获取无动作的未来观测隐变量,该隐变量被反馈以指导规划精炼。随后,未来引导的RefineNet通过施加调度残差更新,将粗粒度语义草稿逐步精炼为细粒度隐动作规划。经过K轮精炼后的规划被并行解码为动作块以实现低延迟执行。本文进一步引入因果精炼分组过程奖励强化学习,通过由隐规划编辑生成的更长视界想象未来所对应的奖励来优化隐精炼过程。在LIBERO基准上的实证评估表明,PearlVLA在现有方法中实现了最先进性能。