Vision Language Action (VLA) models enable instruction following manipulation, yet dualarm deployment remains unsafe due to under modeled selfcollisions between arms and grasped objects. We introduce CoFreeVLA, which augments an endtoend VLA with a short horizon selfcollision risk estimator that predicts collision likelihood from proprioception, visual embeddings, and planned actions. The estimator gates risky commands, recovers to safe states via risk-guided adjustments, and shapes policy refinement for safer rollouts. It is pre-trained with model-based collision labels and posttrained on real robot rollouts for calibration. On five bimanual tasks with the PiPER robot arm, CoFreeVLA reduces selfcollisions and improves success rates versus RDT and APEX.
翻译:视觉-语言-动作模型能够实现基于指令的操作,但由于对双臂及所持物体间自碰撞的建模不足,其在双臂系统中的部署仍存在安全隐患。本文提出CoFreeVLA,该方法通过引入一个短时域自碰撞风险估计器来增强端到端VLA模型。该估计器能够根据本体感知信息、视觉嵌入特征及规划动作预测碰撞概率,并据此拦截高风险指令、通过风险引导的调整恢复至安全状态,同时引导策略优化以生成更安全的动作序列。该估计器首先利用基于模型的碰撞标签进行预训练,随后在真实机器人动作序列上进行后训练以实现校准。在PiPER机器人手臂上进行的五项双手操作任务中,与RDT和APEX方法相比,CoFreeVLA显著降低了自碰撞发生率并提升了任务成功率。