Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
翻译:典型的基于模板的物体位姿估计流程通过检索最匹配的模板并将其与观测图像对齐来估计位姿。然而,若未能检索到正确模板,往往会导致位姿预测不准确。为解决此问题,我们将基于模板的物体位姿估计重新表述为射线对齐问题,即从多个带位姿的模板图像中学习其视线方向,以与一个未带位姿的查询图像对齐。受近期基于扩散的相机位姿估计进展的启发,我们将此表述嵌入到一个扩散Transformer架构中,该架构可将查询图像与一组带位姿的模板对齐。我们使用以物体为中心的相机射线对物体旋转进行重参数化,并通过将尺度不变平移估计扩展至密集平移偏移来建模物体平移。我们的模型利用模板中的几何先验来引导准确的查询位姿推断。基于缩窄模板采样的由粗到精训练策略,在不改变网络架构的情况下提升了性能。在多个基准数据集上的大量实验表明,在未见物体位姿估计任务中,我们的方法相较于现有先进方法取得了具有竞争力的结果。