Robot grasping, whether handling isolated objects, cluttered items, or stacked objects, plays a critical role in industrial and service applications. However, current visual grasp detection methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) struggle to adapt across various grasping scenarios due to the imbalance between local and global feature extraction. In this paper, we propose a novel hybrid Mamba-Transformer approach to address these challenges. Our method improves robotic visual grasping by effectively capturing both global and local information through the integration of Vision Mamba and parallel convolutional-transformer blocks. This hybrid architecture significantly improves adaptability, precision, and flexibility across various robotic tasks. To ensure a fair evaluation, we conducted extensive experiments on the Cornell, Jacquard, and OCID-Grasp datasets, ranging from simple to complex scenarios. Additionally, we performed both simulated and real-world robotic experiments. The results demonstrate that our method not only surpasses state-of-the-art techniques on standard grasping datasets but also delivers strong performance in both simulation and real-world robot applications.
翻译:机器人抓取,无论是处理孤立物体、杂乱物品还是堆叠物体,在工业和服务应用中均扮演着关键角色。然而,当前基于卷积神经网络(CNNs)和视觉Transformer(ViTs)的视觉抓取检测方法,由于局部与全局特征提取之间的不平衡,难以适应各种抓取场景。本文提出了一种新颖的混合Mamba-Transformer方法以应对这些挑战。我们的方法通过集成视觉Mamba和平行卷积-Transformer模块,有效捕获全局和局部信息,从而改进了机器人视觉抓取。这种混合架构显著提高了跨各种机器人任务的适应性、精确性和灵活性。为确保公平评估,我们在从简单到复杂场景的Cornell、Jacquard和OCID-Grasp数据集上进行了大量实验。此外,我们还进行了模拟和真实世界的机器人实验。结果表明,我们的方法不仅在标准抓取数据集上超越了最先进的技术,而且在模拟和真实世界的机器人应用中都表现出色。