Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.
翻译:现有的大多数面向机器人操控的视觉-语言-动作(VLA)模型缺乏进度感知能力,通常依赖手工设计的启发式规则来判定任务终止。这一局限在涉及级联子目标的长时间序列任务中尤为突出。本文研究了任务进度估计与集成方法,提出名为\textbf{ProgressVLA}的新型模型。我们的技术贡献包含两方面:(1)鲁棒的进度估计器:在大规模无监督视频-文本机器人数据集上预训练进度估计器,在仿真环境中得到低至0.07(量程[0,1])的预测残差,并展现出对未见真实世界样本的零样本泛化能力;(2)可微分的进度引导机制:引入逆动力学世界模型,将预测动作令牌映射为未来潜视觉状态,继而通过进度估计器处理这些潜变量。通过施加最大进度正则化,构建可微分管线以提供进度引导式优化,精炼动作令牌。在CALVIN和LIBERO基准上的大量实验,以及真实机器人部署,均一致证明该方法在成功率与泛化能力上显著超越强基线模型。