6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression and an improved variant of the YOLOPose model. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods. We analyze the role of object queries in our architecture and reveal that the object queries specialize in detecting objects in specific image regions. Furthermore, we quantify the accuracy trade-off of using datasets of smaller sizes to train our model.
翻译:6D物体姿态估计是自主机器人操作应用的关键前提。当前最先进的姿态估计模型主要基于卷积神经网络(CNN)。近年来,最初为自然语言处理设计的Transformer架构在许多计算机视觉任务中也取得了最先进的结果。借助多头自注意力机制,Transformer能够实现简单的单阶段端到端架构,用于联合学习目标检测和6D物体姿态估计。本文提出YOLOPose(即You Only Look Once姿态估计的简称),一种基于关键点回归的Transformer多目标6D姿态估计方法,以及YOLOPose模型的改进变体。与使用标准热图预测图像中关键点不同,我们直接回归关键点。此外,我们采用一个可学习的朝向估计模块,从关键点预测朝向。结合独立的平移估计模块,我们的模型实现了端到端可微分。该方法适用于实时应用,并取得了与最先进方法相当的结果。我们分析了目标查询在架构中的作用,发现目标查询专门用于检测特定图像区域中的物体。进一步地,我们量化了使用较小规模数据集训练模型时的精度权衡。