With robots increasingly collaborating with humans in everyday tasks, it is important to take steps toward robotic systems capable of understanding the environment. This work focuses on scene understanding to detect pick and place tasks given initial and final images from the scene. To this end, a dataset is collected for object detection and pick and place task detection. A YOLOv5 network is subsequently trained to detect the objects in the initial and final scenes. Given the detected objects and their bounding boxes, two methods are proposed to detect the pick and place tasks which transform the initial scene into the final scene. A geometric method is proposed which tracks objects' movements in the two scenes and works based on the intersection of the bounding boxes which moved within scenes. Contrarily, the CNN-based method utilizes a Convolutional Neural Network to classify objects with intersected bounding boxes into 5 classes, showing the spatial relationship between the involved objects. The performed pick and place tasks are then derived from analyzing the experiments with both scenes. Results show that the CNN-based method, using a VGG16 backbone, outscores the geometric method by roughly 12 percentage points in certain scenarios, with an overall success rate of 84.3%.
翻译:随着机器人在日常任务中与人类的协作日益增多,开发能够理解环境的机器人系统至关重要。本研究聚焦于场景理解,旨在通过场景的初始图像与最终图像检测拾放任务。为此,我们收集了一个用于目标检测与拾放任务检测的数据集。随后训练了一个YOLOv5网络来检测初始场景与最终场景中的物体。基于检测到的物体及其边界框,本文提出了两种检测拾放任务的方法,这些任务实现了从初始场景到最终场景的转换。首先提出了一种几何方法,该方法通过追踪物体在两场景间的运动,基于场景内移动物体边界框的交集关系进行工作。与之相对,基于CNN的方法利用卷积神经网络将具有相交边界框的物体分类为5个类别,以展示相关物体间的空间关系。通过分析两场景的实验数据,最终推导出已执行的拾放任务。结果表明,采用VGG16骨干网络的CNN方法在特定场景下比几何方法高出约12个百分点,整体成功率达到了84.3%。