We present ForceSight, a system for text-guided mobile manipulation that predicts visual-force goals using a deep neural network. Given a single RGBD image combined with a text prompt, ForceSight determines a target end-effector pose in the camera frame (kinematic goal) and the associated forces (force goal). Together, these two components form a visual-force goal. Prior work has demonstrated that deep models outputting human-interpretable kinematic goals can enable dexterous manipulation by real robots. Forces are critical to manipulation, yet have typically been relegated to lower-level execution in these systems. When deployed on a mobile manipulator equipped with an eye-in-hand RGBD camera, ForceSight performed tasks such as precision grasps, drawer opening, and object handovers with an 81% success rate in unseen environments with object instances that differed significantly from the training data. In a separate experiment, relying exclusively on visual servoing and ignoring force goals dropped the success rate from 90% to 45%, demonstrating that force goals can significantly enhance performance. The appendix, videos, code, and trained models are available at https://force-sight.github.io/.
翻译:我们提出ForceSight系统,一种通过深度神经网络预测视觉-力目标的文本引导移动操控方法。给定单张RGBD图像与文本提示,ForceSight可确定相机坐标系下的目标末端执行器位姿(运动学目标)及关联作用力(力目标)。两者共同构成视觉-力目标。先前研究表明,输出人类可解读运动学目标的深度模型能够使真实机器人实现灵巧操作。力对于操控至关重要,但在这些系统中通常被降级至底层执行层级。当搭载于配备眼在手RGBD摄像头的移动操作平台上时,ForceSight在物体实例与训练数据显著不同的未知环境中,完成精密抓取、抽屉开启和物体交接等任务的成功率达81%。另一项独立实验表明,完全依赖视觉伺服而忽略力目标会导致成功率从90%降至45%,证实力目标可显著增强系统性能。附录、视频、代码及训练模型均发布在https://force-sight.github.io/。