Mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Experimental results demonstrate that InCoM significantly outperforms state-of-the-art methods, achieving success rate gains of 28.2%, 26.1%, and 23.6% across three ManiSkill-HAB scenarios without privileged information. Furthermore, its effectiveness is consistently validated in real-world mobile manipulation tasks, where InCoM maintains a superior success rate over existing baselines.
翻译:移动操作是通用型机器人代理的基本能力,要求实现对移动基座与机械臂的协同控制以及在动态变化视角下的鲁棒感知。然而,现有方法面临两大关键挑战:基座与手臂动作的强耦合使控制优化复杂化,且移动操作中视角变化时常导致感知注意力分配不当。我们提出InCoM——一种面向移动操作的意图驱动感知与结构化协同框架。InCoM通过推断潜在运动意图,动态重新加权多尺度感知特征,实现阶段自适应的感知注意力分配。为支持鲁棒的跨模态感知,InCoM进一步引入几何-语义结构化对齐机制,增强多模态对应关系。在控制方面,我们设计了去耦型协同流匹配动作解码器,显式建模基座-手臂协同动作生成,缓解控制耦合导致的优化难题。实验结果表明,InCoM显著优于现有最优方法,在三个ManiSkill-HAB场景中无需特权信息即可实现28.2%、26.1%和23.6%的成功率提升。此外,其在真实世界移动操作任务中的有效性得到一致验证,InCoM始终保持相较于现有基准的优越成功率。