Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. SVM mitigates this using vision models that out-paint the end-effector, thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for SVM to seek semantic targets (e.g. knobs) and point tracking methods can help SVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. SVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50%.
翻译:许多日常移动操控任务需要与小型物体进行精确交互,例如抓握把手以打开柜门或按压电灯开关。本文提出基于视觉模型的伺服控制框架,该闭环系统使移动机械臂能够完成涉及小型物体操控的此类精密任务。该框架利用前沿视觉基础模型生成三维目标点以实现视觉伺服控制,从而支持在新环境中执行多样化任务。由于末端执行器遮挡问题,直接应用该方法会导致失效。本框架通过采用能够补全末端执行器遮挡区域的视觉模型来缓解此问题,从而显著提升目标定位精度。实验表明,借助图像补全方法,开放词汇物体检测器可作为本框架的即插即用模块来寻找语义目标(如把手),而点跟踪方法可帮助系统可靠追踪用户点击指定的交互位点。我们在包含6栋建筑、10个新环境中的72个不同物体实例上进行了大规模评估。该框架在真实世界新环境中操控未见物体时获得71%的零样本成功率,较开环控制方法绝对提升42%,相较于经过1000多次演示训练的模仿学习基线也获得50%的绝对成功率提升。