In the current state of 6D pose estimation, top-performing techniques depend on complex intermediate correspondences, specialized architectures, and non-end-to-end algorithms. In contrast, our research reframes the problem as a straightforward regression task by exploring the capabilities of Vision Transformers for direct 6D pose estimation through a tailored use of classification tokens. We also introduce a simple method for determining pose confidence, which can be readily integrated into most 6D pose estimation frameworks. This involves modifying the transformer architecture by decreasing the number of query elements based on the network's assessment of the scene complexity. Our method that we call Pose Vision Transformer or PViT-6D provides the benefits of simple implementation and being end-to-end learnable while outperforming current state-of-the-art methods by +0.3% ADD(-S) on Linemod-Occlusion and +2.7% ADD(-S) on the YCB-V dataset. Moreover, our method enhances both the model's interpretability and the reliability of its performance during inference.
翻译:在当前6D姿态估计的研究现状中,顶尖方法依赖于复杂的中间对应关系、专用架构以及非端到端算法。相比之下,我们的研究通过探索视觉Transformer在直接6D姿态估计中的能力,将问题重构为一个简单的回归任务——利用分类标记的定制化应用。我们还引入了一种简单的姿态置信度确定方法,该方法可轻松集成到大多数6D姿态估计框架中,具体做法是通过根据网络对场景复杂度的评估减少查询元素的数量来修改Transformer架构。我们提出的方法称为姿态视觉Transformer(PViT-6D),具有实现简单、端到端可学习的优势,同时在Linemod-Occlusion数据集上以ADD(-S)指标超越当前最优方法0.3%,在YCB-V数据集上以ADD(-S)指标超越2.7%。此外,我们的方法增强了模型的可解释性以及推理过程中的性能可靠性。