基于单实例RGB演示的类别级最后一米导航学习 (Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance)

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 73.47% success in edge-alignment and 96.94% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: https://rpm-lab-umn.github.io/category-level-last-meter-nav/

翻译：实现移动机械臂基座的精确定位对于后续成功执行操控动作至关重要。大多数基于RGB的导航系统仅能保证米级的粗略精度，使其不太适用于移动操控的精确定位阶段。这一差距导致操控策略无法在其训练演示的分布范围内运行，从而造成频繁的执行失败。我们通过引入一种面向对象的模仿学习框架来解决最后一米导航问题，使四足移动机械臂机器人仅利用其机载摄像头的RGB观测即可实现准备操控的定位。我们的方法将导航策略建立在三个输入条件上：目标图像、来自机载摄像头的多视角RGB观测，以及指定目标物体的文本提示。随后，一个语言驱动的分割模块和一个空间得分矩阵解码器提供显式的物体定位和相对位姿推理。利用类别内单个物体实例的真实世界数据，该系统能够泛化到不同环境中具有挑战性光照和背景条件的未见物体实例。为全面评估此性能，我们引入了两个指标：使用真实朝向的边缘对齐指标，以及评估机器人视觉上面对目标物体程度的物体对齐指标。在这些指标下，我们的策略在相对于未见目标物体进行定位时，边缘对齐成功率达到73.47%，物体对齐成功率达到96.94%。这些结果表明，无需深度信息、激光雷达或先验地图，即可在类别级别实现精确的最后一米导航，为迈向统一的移动操控提供了一条可扩展的路径。项目页面：https://rpm-lab-umn.github.io/category-level-last-meter-nav/