Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
翻译:视觉可操作功能已成为机器人学中的一种变革性方法,其核心在于在操控前感知交互区域。传统方法依赖于像素采样来识别成功的交互样本,或处理点云以进行功能映射。然而,这些方法计算密集,且难以适应多样化和动态的环境。本文介绍了ManipGPT,这是一个利用大型预训练视觉Transformer(ViT)来预测关节物体最优交互区域的框架。我们创建了一个包含9.9k张模拟和真实图像的数据集,以弥合仿真与现实之间的差距并增强现实世界的适用性。通过在此小数据集上对视觉Transformer进行微调,我们显著提升了部件级功能分割的性能,使模型的上下文分割能力适应机器人操控场景。该方法通过生成部件级功能掩码,并结合阻抗适应策略,实现了在模拟和真实环境中的有效操控,从而完全无需复杂的感知系统或数据集。