World Models for General Surgical Grasping

Intelligent vision control systems for surgical robots should adapt to unknown and diverse objects while being robust to system disturbances. Previous methods did not meet these requirements due to mainly relying on pose estimation and feature tracking. We propose a world-model-based deep reinforcement learning framework "Grasp Anything for Surgery" (GAS), that learns a pixel-level visuomotor policy for surgical grasping, enhancing both generality and robustness. In particular, a novel method is proposed to estimate the values and uncertainties of depth pixels for a rigid-link object's inaccurate region based on the empirical prior of the object's size; both depth and mask images of task objects are encoded to a single compact 3-channel image (size: 64x64x3) by dynamically zooming in the mask regions, minimizing the information loss. The learned controller's effectiveness is extensively evaluated in simulation and in a real robot. Our learned visuomotor policy handles: i) unseen objects, including 5 types of target grasping objects and a robot gripper, in unstructured real-world surgery environments, and ii) disturbances in perception and control. Note that we are the first work to achieve a unified surgical control system that grasps diverse surgical objects using different robot grippers on real robots in complex surgery scenes (average success rate: 69%). Our system also demonstrates significant robustness across 6 conditions including background variation, target disturbance, camera pose variation, kinematic control error, image noise, and re-grasping after the gripped target object drops from the gripper. Videos and codes can be found on our project page: https://linhongbin.github.io/gas/.

翻译：手术机器人的智能视觉控制系统应能适应未知且多样化的物体，同时具备对系统扰动的鲁棒性。先前方法主要依赖位姿估计与特征跟踪，未能满足这些要求。我们提出一种基于世界模型的深度强化学习框架"Grasp Anything for Surgery"（GAS），通过学习用于手术抓取的像素级视觉运动策略，同时提升通用性与鲁棒性。具体而言，我们提出一种新方法，基于物体尺寸的经验先验，估计刚性连接物体不精确区域的深度像素值及其不确定性；通过动态放大掩码区域，将任务物体的深度图像与掩码图像编码为单张紧凑的三通道图像（尺寸：64x64x3），从而最小化信息损失。学习所得控制器的有效性在仿真与真实机器人上得到全面评估。我们习得的视觉运动策略能够处理：i) 非结构化真实手术环境中的未见物体，包括5类目标抓取物体及机器人夹爪；ii) 感知与控制中的扰动。需要指出，本工作是首个在复杂手术场景中（平均成功率：69%）利用不同机器人夹爪抓取多样手术物体、并在真实机器人上实现统一手术控制系统的研究。我们的系统还在6种条件下展现出显著鲁棒性，包括背景变化、目标扰动、相机位姿变化、运动学控制误差、图像噪声以及被抓取目标物体从夹爪脱落后的重新抓取。视频与代码详见项目页面：https://linhongbin.github.io/gas/。