Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by replacing established methods like bundle adjustment and feature matching with a simple, unified, feed-forward neural network that predicts camera poses, depth maps, and dense 3D structure directly from multiple images of a scene in a few seconds. A key aspect is its ability to process an arbitrary number of views consistently in a single forward pass without any post-processing or iterative optimization. For photogrammetry, this opens new possibilities for real-time, scalable, and accessible 3D reconstruction. In this context, not only high reconstruction accuracy but also high-quality uncertainty estimates are crucial, as they foster trust and enable robust quality assurance. This paper therefore investigates the quality of VGGT's uncertainty predictions. The analysis identifies an effective confidence threshold for filtering VGGT's raw output and demonstrates that enhancing uncertainty quality holds strong potential for improving the accuracy of its 3D reconstructions.
翻译:视觉几何基础Transformer(VGGT)在短时间内便引起了广泛关注,尤其是因其在CVPR-2025上获得最佳论文奖。与DUSt3R和MASt3R类似,VGGT旨在通过用一种简单、统一的前馈神经网络替代束调整和特征匹配等成熟方法,实现范式转变。该网络可直接从场景的多幅图像中,在数秒内预测相机位姿、深度图和稠密三维结构。其关键特性在于能在一次前向传播中一致地处理任意数量的视图,无需任何后处理或迭代优化。对于摄影测量学而言,这为实时、可扩展且易获取的三维重建开辟了新的可能性。在此背景下,不仅高重建精度至关重要,高质量的不确定性估计同样不可或缺,因为它能建立信任并实现稳健的质量保证。因此,本文研究了VGGT不确定性预测的质量。分析确定了用于过滤VGGT原始输出的有效置信度阈值,并表明提升不确定性质量对提高其三维重建准确性具有巨大潜力。