This paper addresses the challenging problem of category-level pose estimation. Current state-of-the-art methods for this task face challenges when dealing with symmetric objects and when attempting to generalize to new environments solely through synthetic data training. In this work, we address these challenges by proposing a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes as well as establishing correspondences essential for pose estimation. Furthermore, we introduce critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations. We demonstrate the effectiveness of our method by testing it on a range of real datasets. Despite being trained solely on our generated synthetic data, our approach achieves state-of-the-art performance and unprecedented generalization qualities, outperforming baselines, even those specifically trained on the target domain.
翻译:本文针对类别级姿态估计这一具有挑战性的问题展开研究。当前该任务的最先进方法在处理对称物体以及仅通过合成数据训练泛化至新环境时面临困难。为解决这些挑战,本文提出一种基于扩散的概率模型,用于估计密集典型映射——该映射对于恢复部分物体形状及建立姿态估计所需的关键对应关系至关重要。进一步地,我们通过利用扩散模型与多模态输入表示的优势,引入关键组件以提升性能。通过在一系列真实数据集上进行测试,我们验证了方法的有效性。尽管仅基于我们生成的合成数据进行训练,本方法仍实现了最先进的性能与空前的泛化能力,其表现超越基线方法,甚至包括那些专门在目标域上训练的模型。