To advance autonomous dexterous manipulation, we propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models. The VLA model provides language commanded high-level planning, which is highly generalizable, while the diffusion model handles low-level interactions which offers the precision and robustness required for specific objects and environments. By incorporating a switching signal into the training-data, we enable event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language. This approach is deployed on our anthropomorphic ADAPT Hand 2, a 13DoF robotic hand, which incorporates compliance through series elastic actuation allowing for resilience for any interactions: showing the first use of a multi-fingered hand controlled with a VLA model. We demonstrate this model switching approach results in a over 80\% success rate compared to under 40\% when only using a VLA model, enabled by accurate near-object arm motion by the VLA model and a multi-modal grasping motion with error recovery abilities from the diffusion model.
翻译:为推进自主灵巧操作研究,本文提出一种混合控制方法,融合了微调视觉-语言-动作模型与扩散模型的相对优势。VLA模型提供语言指令驱动的高层规划,具备高度泛化能力;而扩散模型处理低层交互,为特定物体与环境提供所需的精确性与鲁棒性。通过在训练数据中嵌入切换信号,我们实现了基于事件的两模型间切换机制,适用于目标物体与放置位置均由语言指令指定的抓放任务。该方法部署于我们研发的拟人化ADAPT Hand 2——一款13自由度的机器人手,其通过串联弹性驱动实现柔顺控制,能适应各类交互场景:本研究首次展示了多指手在VLA模型控制下的实际应用。实验表明,该模型切换方法的任务成功率超过80%,而单一VLA模型成功率不足40%。其优势源于VLA模型提供的精确近物臂部运动,以及扩散模型支持的多模态抓取动作与误差恢复能力。