We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.
翻译:我们提出了UniGraspTransformer,这是一种基于Transformer的通用网络,用于灵巧机器人抓取任务,它在简化训练过程的同时,提升了可扩展性和性能。与先前的方法(如UniDexGrasp++)需要复杂、多步骤的训练流程不同,UniGraspTransformer遵循一个简化的流程:首先,使用强化学习为单个物体训练专用的策略网络,以生成成功的抓取轨迹;然后,将这些轨迹蒸馏到一个单一的通用网络中。我们的方法使UniGraspTransformer能够有效扩展,通过整合多达12个自注意力块来处理数千个具有不同姿态的物体。此外,它在理想化输入和真实世界输入上均表现出良好的泛化能力,并在基于状态和基于视觉的设置中进行了评估。值得注意的是,UniGraspTransformer能为不同形状和方向的物体生成更广泛的抓取姿态,从而产生更多样化的抓取策略。实验结果表明,相较于最先进的方法UniDexGrasp++,UniGraspTransformer在各种物体类别上均取得了显著提升,在基于视觉的设置中,分别在已见物体、已见类别内的未见物体以及完全未见物体上实现了3.5%、7.7%和10.1%的成功率增益。项目页面:https://dexhand.github.io/UniGraspTransformer。