Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
翻译:视觉可供性感知已成为机器人学中的一种变革性方法,其核心在于在操作前感知交互区域。传统方法依赖于像素采样来识别成功的交互样本,或处理点云以进行可供性映射。然而,这些方法计算密集,且难以适应多样化和动态的环境。本文介绍了ManipGPT,这是一个旨在利用大型预训练视觉Transformer(ViT)预测关节物体最优交互区域的框架。我们创建了一个包含9.9k张仿真与真实图像的数据库,以弥合仿真与现实的差距并增强现实世界的适用性。通过在此小型数据集上对视觉Transformer进行微调,我们显著提升了部件级可供性分割性能,使模型的上下文内分割能力适应机器人操作场景。该方法通过生成部件级可供性掩码,并结合阻抗自适应策略,实现了在仿真和现实环境中的有效操作,从而完全无需复杂的数据库或感知系统。