While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.
翻译:尽管大型视觉-语言-动作(VLA)模型与生成式世界模型(WM)已推动长程具身智能的发展,但其实际部署仍受限于基于学习的动作生成中的不确定性。低质量动作可能在执行过程中引发物理故障,或导致世界模型推演产生误导性结果并伴随冗余渲染成本。为应对该问题,我们提出Pre-VLA——一种在物理执行或世界模型想象之前进行抢先式动作有效性评估的统一运行时验证架构。Pre-VLA利用高效多模态主干网络,结合模态感知池化与轻量级双分支预测头,对候选动作块同时预测安全置信度与基于批评器的优势分数。为处理严重的类别不平衡与边界决策不稳定问题,我们通过结合焦点分类、优势回归与软阈值校准的多任务目标训练Pre-VLA。部署阶段,一种双模式抢先式重采样调度器在有限计算预算下过滤低质量动作并触发自适应重采样。在LIBERO基准上的实验表明,Pre-VLA将RynnVLA-002在四套任务套件上的平均闭环成功率从30.79%提升至37.62%,减少任务执行步数,实现每个动作块平均183.9毫秒的前向验证时间,并缓解世界模型推演中的误差累积。