Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.
翻译:基于Transformer的通用视觉几何框架在相机姿态估计和三维场景理解方面已展现出有前景的性能。视觉几何基础Transformer模型的最新进展在相机姿态估计和三维重建中显示出巨大潜力。然而,这些模型通常依赖真实标签进行训练,在适配未标注和未见场景时面临挑战。本文提出一种自监督框架,利用未标注数据训练VGGT,从而增强其在大规模环境中的定位能力。为实现这一目标,我们将传统的成对关系扩展为序列式几何约束以进行自监督学习。具体而言,在每个序列中,我们对多个源帧进行采样,并将其几何投影到不同的目标帧上,这提升了时序特征的一致性。我们将物理光度一致性与几何约束公式化为联合优化损失,以规避对硬标签的需求。通过使用所提方法训练模型,不仅局部和全局的跨视图注意力层,而且相机头和深度头都能有效捕捉底层的多视图几何关系。实验表明,模型在数百次迭代内收敛,并在大规模定位任务中取得了显著提升。我们的代码将在 https://github.com/X-yangfan/GPA-VGGT 发布。