Clothes Grasping and Unfolding Based on RGB-D Semantic Segmentation

Clothes grasping and unfolding is a core step in robotic-assisted dressing. Most existing works leverage depth images of clothes to train a deep learning-based model to recognize suitable grasping points. These methods often utilize physics engines to synthesize depth images to reduce the cost of real labeled data collection. However, the natural domain gap between synthetic and real images often leads to poor performance of these methods on real data. Furthermore, these approaches often struggle in scenarios where grasping points are occluded by the clothing item itself. To address the above challenges, we propose a novel Bi-directional Fractal Cross Fusion Network (BiFCNet) for semantic segmentation, enabling recognition of graspable regions in order to provide more possibilities for grasping. Instead of using depth images only, we also utilize RGB images with rich color features as input to our network in which the Fractal Cross Fusion (FCF) module fuses RGB and depth data by considering global complex features based on fractal geometry. To reduce the cost of real data collection, we further propose a data augmentation method based on an adversarial strategy, in which the color and geometric transformations simultaneously process RGB and depth data while maintaining the label correspondence. Finally, we present a pipeline for clothes grasping and unfolding from the perspective of semantic segmentation, through the addition of a strategy for grasp point selection from segmentation regions based on clothing flatness measures, while taking into account the grasping direction. We evaluate our BiFCNet on the public dataset NYUDv2 and obtained comparable performance to current state-of-the-art models. We also deploy our model on a Baxter robot, running extensive grasping and unfolding experiments as part of our ablation studies, achieving an 84% success rate.

翻译：衣物抓取与展平是机器人辅助穿衣中的核心步骤。现有研究大多利用衣物的深度图像训练深度学习模型以识别合适的抓取点。这些方法通常借助物理引擎合成深度图像，以减少真实标注数据的采集成本。然而，合成图像与真实图像之间的自然域差异常导致此类方法在真实数据上的性能不佳。此外，当抓取点被衣物自身遮挡时，这些方法往往难以应对。针对上述挑战，我们提出了一种新颖的双向分形交叉融合网络（BiFCNet）用于语义分割，能够识别可抓取区域，从而为抓取提供更多可能性。与仅使用深度图像不同，我们同时利用具有丰富颜色特征的RGB图像作为网络输入，其中分形交叉融合（FCF）模块基于分形几何考虑全局复杂特征，融合RGB与深度数据。为降低真实数据采集成本，我们进一步提出基于对抗策略的数据增强方法，该方法在保持标签对应关系的同时，对RGB与深度数据同步进行颜色与几何变换。最后，我们从语义分割视角提出衣物抓取与展平的完整流程，通过引入基于衣物平坦度测量的分割区域抓取点选择策略，并兼顾抓取方向。我们在公开数据集NYUDv2上评估BiFCNet，取得了与当前最先进模型相当的性能。同时将该模型部署于Baxter机器人，在消融实验中进行了大量抓取与展平实验，成功率达84%。