Visual servo techniques guide robotic motion using visual information to accomplish manipulation tasks, requiring high precision and robustness against noise. Traditional methods often require prior knowledge and are susceptible to external disturbances. Learning-driven alternatives, while promising, frequently struggle with the scarcity of training data and fall short in generalization. To address these challenges, we propose a novel visual servo framework Depth-PC that leverages simulation training and exploits semantic and geometric information of keypoints from images, enabling zero-shot transfer to real-world servo tasks. Our framework focuses on the servo controller which intertwines keypoint feature queries and relative depth information. Subsequently, the fused features from these two modalities are then processed by a Graph Neural Network to establish geometric and semantic correspondence between keypoints and update the robot state. Through simulation and real-world experiments, our approach demonstrates superior convergence basin and accuracy compared to state-of-the-art methods, fulfilling the requirements for robotic servo tasks while enabling zero-shot application to real-world scenarios. In addition to the enhancements achieved with our proposed framework, we have also substantiated the efficacy of cross-modality feature fusion within the realm of servo tasks.
翻译:视觉伺服技术利用视觉信息引导机器人运动以完成操控任务,要求高精度和对噪声的鲁棒性。传统方法通常需要先验知识且易受外部干扰影响。基于学习的方法虽前景广阔,但常受限于训练数据稀缺且泛化能力不足。为应对这些挑战,我们提出一种新颖的视觉伺服框架Depth-PC,该框架利用仿真训练并提取图像中关键点的语义与几何信息,从而实现零样本迁移至真实世界伺服任务。我们的框架聚焦于伺服控制器,该控制器交织了关键点特征查询与相对深度信息。随后,来自这两种模态的融合特征由图神经网络处理,以建立关键点间的几何与语义对应关系并更新机器人状态。通过仿真与真实世界实验,相较于现有先进方法,我们的方法展现出更优的收敛域和精度,满足了机器人伺服任务的要求,同时实现了在真实场景中的零样本应用。除了所提框架带来的性能提升,我们也证实了跨模态特征融合在伺服任务领域的有效性。