Reliable robotic grasping, especially with deformable objects such as fruits, remains a challenging task due to underactuated contact interactions with a gripper, unknown object dynamics and geometries. In this study, we propose a Transformer-based robotic grasping framework for rigid grippers that leverage tactile and visual information for safe object grasping. Specifically, the Transformer models learn physical feature embeddings with sensor feedback through performing two pre-defined explorative actions (pinching and sliding) and predict a grasping outcome through a multilayer perceptron (MLP) with a given grasping strength. Using these predictions, the gripper predicts a safe grasping strength via inference. Compared with convolutional recurrent networks, the Transformer models can capture the long-term dependencies across the image sequences and process spatial-temporal features simultaneously. We first benchmark the Transformer models on a public dataset for slip detection. Following that, we show that the Transformer models outperform a CNN+LSTM model in terms of grasping accuracy and computational efficiency. We also collect a new fruit grasping dataset and conduct online grasping experiments using the proposed framework for both seen and unseen fruits. Our codes and dataset are public on GitHub.
翻译:可靠的机器人抓取(尤其是针对水果等可变形物体)仍是一项具有挑战性的任务,原因在于欠驱动夹爪与物体的接触交互、未知的物体动力学特性及几何形状。本研究提出一种基于Transformer的刚体夹爪机器人抓取框架,利用触觉与视觉信息实现安全物体抓取。具体而言,Transformer模型通过执行两种预定义的探索性动作(捏取和滑动)学习传感器反馈的物理特征嵌入,并通过多层感知机(MLP)在给定抓取力条件下预测抓取结果。基于这些预测结果,夹爪可通过推理确定安全抓取力度。相较于卷积循环网络,Transformer模型能同时捕捉图像序列中的长期依赖关系并处理时空特征。我们首先在公开数据集上对Transformer模型进行滑动检测基准测试,随后证明其在抓取精度和计算效率上均优于CNN+LSTM模型。此外,我们收集了新的水果抓取数据集,并利用所提框架对已知和未知水果进行在线抓取实验。相关代码与数据集已在GitHub上开源。