Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.
翻译:开放词汇移动操控(OVMM)要求机器人遵循语言指令,在动态环境变化中完成导航与操作,并实时更新其世界表征。然而,现有方法大多仅在离散更新点(如导航目标、路径点或动作步骤结束时)更新世界表征,导致机器人在更新间隔内处于“盲视”状态,进而引发级联故障:包括对象遗漏、错误检测延迟及重规划滞后。为突破此局限,我们提出BINDER(桥接即时与审慎推理的双过程框架),该框架将战略规划与持续环境监控解耦。具体而言,BINDER整合了审慎响应模块(DRM,基于多模态大语言模型的任务规划器)与即时响应模块(IRM,基于视频大语言模型的持续监控器)。两模块功能互补:DRM通过结构化三维场景更新执行战略规划,并引导IRM的注意力分配;IRM则通过分析视频流实时更新记忆、修正执行中的动作,并在必要时触发重规划。通过这种双向协同机制,系统在保持环境感知与避免高成本更新之间取得平衡,实现了动态条件下的鲁棒自适应能力。在三个具有动态物体摆放的真实环境测试中,BINDER相比当前最优基线方法显著提升了任务成功率与执行效率,证明了其在实际部署中的有效性。