Clothes Grasping and Unfolding Based on RGB-D Semantic Segmentation

Clothes grasping and unfolding is a core step in robotic-assisted dressing. Most existing works leverage depth images of clothes to train a deep learning-based model to recognize suitable grasping points. These methods often utilize physics engines to synthesize depth images to reduce the cost of real labeled data collection. However, the natural domain gap between synthetic and real images often leads to poor performance of these methods on real data. Furthermore, these approaches often struggle in scenarios where grasping points are occluded by the clothing item itself. To address the above challenges, we propose a novel Bi-directional Fractal Cross Fusion Network (BiFCNet) for semantic segmentation, enabling recognition of graspable regions in order to provide more possibilities for grasping. Instead of using depth images only, we also utilize RGB images with rich color features as input to our network in which the Fractal Cross Fusion (FCF) module fuses RGB and depth data by considering global complex features based on fractal geometry. To reduce the cost of real data collection, we further propose a data augmentation method based on an adversarial strategy, in which the color and geometric transformations simultaneously process RGB and depth data while maintaining the label correspondence. Finally, we present a pipeline for clothes grasping and unfolding from the perspective of semantic segmentation, through the addition of a strategy for grasp point selection from segmentation regions based on clothing flatness measures, while taking into account the grasping direction. We evaluate our BiFCNet on the public dataset NYUDv2 and obtained comparable performance to current state-of-the-art models. We also deploy our model on a Baxter robot, running extensive grasping and unfolding experiments as part of our ablation studies, achieving an 84% success rate.

翻译：衣物抓取与展开是机器人辅助穿衣中的关键步骤。现有方法大多利用衣物的深度图像训练基于深度学习的模型，以识别合适的抓取点。这些方法常借助物理引擎合成深度图像，以降低真实标注数据的采集成本。然而，合成图像与真实图像之间的固有域差异，往往导致这些方法在真实数据上表现不佳。此外，当抓取点被衣物自身遮挡时，这些方法通常难以有效应对。针对上述挑战，我们提出了一种新颖的双向分形交叉融合网络（BiFCNet），用于语义分割，以识别可抓取区域，从而为抓取提供更多可能性。该网络不仅使用深度图像，还引入包含丰富颜色特征的RGB图像作为输入，其中分形交叉融合（FCF）模块基于分形几何考虑全局复杂特征，融合RGB与深度数据。为降低真实数据采集成本，我们进一步提出了一种基于对抗策略的数据增强方法，该方法同时对RGB和深度数据进行颜色与几何变换，同时保持标签对应关系。最后，我们从语义分割角度构建了衣物抓取与展开的流程，通过引入基于衣物平整度度量的分割区域抓取点选择策略，并综合考虑抓取方向。我们在公开数据集NYUDv2上评估了BiFCNet，获得了与当前最先进模型相当的性能。同时，将模型部署于Baxter机器人上，通过消融研究开展了大量抓取与展开实验，成功率达到84%。