Reliable robotic grasping, especially with deformable objects such as fruits, remains a challenging task due to underactuated contact interactions with a gripper, unknown object dynamics and geometries. In this study, we propose a Transformer-based robotic grasping framework for rigid grippers that leverage tactile and visual information for safe object grasping. Specifically, the Transformer models learn physical feature embeddings with sensor feedback through performing two pre-defined explorative actions (pinching and sliding) and predict a grasping outcome through a multilayer perceptron (MLP) with a given grasping strength. Using these predictions, the gripper predicts a safe grasping strength via inference. Compared with convolutional recurrent networks, the Transformer models can capture the long-term dependencies across the image sequences and process spatial-temporal features simultaneously. We first benchmark the Transformer models on a public dataset for slip detection. Following that, we show that the Transformer models outperform a CNN+LSTM model in terms of grasping accuracy and computational efficiency. We also collect a new fruit grasping dataset and conduct online grasping experiments using the proposed framework for both seen and unseen fruits. {In addition, we extend our model to objects with different shapes and demonstrate the effectiveness of our pre-trained model trained on our large-scale fruit dataset. Our codes and dataset are public on GitHub.
翻译:可靠机器人抓取,特别是针对水果等形变物体,仍是一项具有挑战性的任务,原因在于欠驱动夹爪的接触交互、物体未知的动力学与几何特性。本研究提出一种基于Transformer的刚性夹爪机器人抓取框架,利用触觉与视觉信息实现安全物体抓取。具体而言,Transformer模型通过执行两种预定义的探索动作(夹捏与滑动),学习基于传感器反馈的物理特征嵌入,并通过多层感知机(MLP)在给定抓取力度下预测抓取结果。基于这些预测,夹爪通过推理预测安全抓取力度。与卷积循环网络相比,Transformer模型能够捕获图像序列中的长期依赖关系,并同时处理时空特征。我们首先在公开数据集上对Transformer模型进行滑动检测基准测试。随后,研究表明Transformer模型在抓取精度与计算效率上均优于CNN+LSTM模型。此外,我们采集了新的水果抓取数据集,并利用所提框架对已见与未见水果开展在线抓取实验。最后,我们将模型扩展至不同形状的物体,并验证了基于大规模水果数据集预训练模型的有效性。我们的代码与数据集已在GitHub上公开。