6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .
翻译:6D物体姿态估计(预测物体相对于相机的变换)在应对未见物体时仍具有挑战性。现有方法通常依赖于显式构建查询图像与物体模型或模板图像之间的特征对应关系。本文提出PoseGAM,一种几何感知的多视角框架,可直接从查询图像和多个模板图像中预测物体姿态,无需显式匹配。该方法基于最近的多视角基础模型架构,通过两种互补机制整合物体几何信息:显式的基于点的几何信息与来自几何表示网络的学习特征。此外,我们构建了一个包含超过19万个物体的大规模合成数据集,覆盖多种环境条件,以增强鲁棒性与泛化能力。在多个基准上的大量评估表明,我们的方法达到了最先进的性能,相较于先前方法平均AR提升5.1%,在个别数据集上最高提升达17.6%,展现出对未见物体的强泛化能力。项目页面:https://windvchen.github.io/PoseGAM/ 。