Vision Language Action (VLA) models enable instruction following manipulation, yet dualarm deployment remains unsafe due to under modeled selfcollisions between arms and grasped objects. We introduce CoFreeVLA, which augments an endtoend VLA with a short horizon selfcollision risk estimator that predicts collision likelihood from proprioception, visual embeddings, and planned actions. The estimator gates risky commands, recovers to safe states via risk-guided adjustments, and shapes policy refinement for safer rollouts. It is pre-trained with model-based collision labels and posttrained on real robot rollouts for calibration. On five bimanual tasks with the PiPER robot arm, CoFreeVLA reduces selfcollisions and improves success rates versus RDT and APEX.
翻译:视觉-语言-动作模型能够实现指令跟随操作,但由于对双臂及抓持物体间自碰撞的建模不足,其双臂部署仍存在安全隐患。本文提出CoFreeVLA,该方法通过一个短时域自碰撞风险估计器增强端到端VLA模型,该估计器能够根据本体感知、视觉嵌入与规划动作预测碰撞概率。该估计器可拦截高风险指令,通过风险引导的调整恢复至安全状态,并引导策略优化以生成更安全的执行轨迹。该估计器首先使用基于模型的碰撞标签进行预训练,随后在真实机器人轨迹上进行后训练以完成校准。在PiPER机器人手臂的五项双手操作任务中,相较于RDT与APEX方法,CoFreeVLA显著减少了自碰撞并提升了任务成功率。