Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.
翻译:人类在与环境交互时展现出灵活切换不同感官的卓越能力。想象一位厨师根据菜肴的颜色、声音和香气,巧妙地把握食材加入时机并调控火候,在复杂烹饪过程的每个阶段间自如切换。这种能力建立在对任务阶段的深刻理解之上,因为每个阶段子目标的达成可能需要调用不同的感官。为使机器人具备类似能力,我们将基于子目标划分的任务阶段纳入模仿学习过程,以指导动态多模态融合。本文提出MS-Bot——一种具有从粗到细阶段理解能力的分阶段引导动态多模态融合方法,该方法根据预测当前阶段内的细粒度状态动态调整各模态的优先级。我们训练配备视觉、听觉与触觉传感器的机器人系统完成两项具有挑战性的操作任务:倾倒作业与键槽式销钉装配。实验结果表明,相较于现有方法,我们的方法能实现更高效且可解释的动态融合,其融合过程更贴近人类的多模态决策机制。