In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.
翻译:本文提出了一种基于关键点热力图回归的模块化6D姿态估计框架。我们的方法结合了用于目标检测的YOLOv10m与基于ResNet18的网络,该网络可从RGB图像预测2D热力图。从这些热力图中提取的关键点通过PnP RANSAC算法用于估计物体的6D姿态。我们比较了不同的关键点选择策略,以评估它们对姿态精度的影响。此外,我们通过使用交叉融合架构引入深度数据来扩展基线模型,该架构能够在多个阶段实现RGB特征与深度特征的交互。我们进一步探索了通用的训练改进方法,例如尝试不同的激活函数和学习率调度策略以提升模型性能。我们最优的纯RGB模型在LINEMOD数据集上达到了84.50%的平均ADD精度,而RGB-D融合模型则达到了92.41%。代码开源于https://github.com/ameermasood/HeatNet。