Whole-body mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates whole-body control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for whole-body mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Without access to privileged perceptual information, InCoM outperforms state-of-the-art methods on three ManiSkill-HAB scenarios by 28.2%, 26.1%, and 23.6% in success rate, demonstrating strong effectiveness for whole-body mobile manipulation.
翻译:全身移动操作是通用机器人智能体的基本能力,既需要协调控制移动基座与机械臂,又要求在动态变化的视角下具备鲁棒的感知能力。然而,现有方法面临两大关键挑战:基座与手臂动作间的强耦合使全身控制优化复杂化;随着移动操作过程中视角变化,感知注意力的分配往往不佳。我们提出InCoM,一种面向全身移动操作的意图驱动感知与结构化协调框架。InCoM通过推断潜在运动意图来动态重加权多尺度感知特征,实现阶段自适应的感知注意力分配。为支持鲁棒的跨模态感知,InCoM进一步引入几何-语义结构化对齐机制,以增强多模态对应关系。在控制层面,我们设计了解耦协调流匹配动作解码器,显式建模协调的基座-手臂动作生成,从而缓解由控制耦合引起的优化困难。在未使用特权感知信息的条件下,InCoM在三个ManiSkill-HAB场景上的成功率分别超越现有最优方法28.2%、26.1%和23.6%,充分证明了其在全身移动操作任务中的卓越有效性。