Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/
翻译:近期前馈重建模型(如VGGT)已展现出与传统优化类重建方法相抗衡的性能,同时还能提供对其他任务有用的几何感知特征。本文证明,此类模型的质量随模型规模和数据量呈可预测的缩放趋势。为此,我们提出VGGT-$Ω$,该模型在静态与动态场景中显著提升了重建精度、效率及能力。为实现前所未有的训练规模,我们引入了提升训练效率的架构改进、支持动态场景的高质量数据标注流程,以及自监督学习协议。通过采用带多任务监督的单一稠密预测头,并移除高分辨率卷积层,我们简化了VGGT的架构。此外,我们利用寄存器将场景信息凝聚为紧凑表征,并引入寄存器注意力机制——将帧间信息交换限制于这些寄存器中,部分替代全局注意力。采用这些设计后,VGGT-$Ω$在训练时消耗的GPU内存仅为前代模型的30%,这使我们能使用比之前工作多15倍的监督数据进行训练,并充分利用海量无标注视频数据。VGGT-$Ω$在多个基准上对静态与动态场景的重建均取得优异结果,例如在Sintel数据集上的相机位姿估计精度较此前最优方法提升77%。我们还证明,所学得的寄存器可增强视觉-语言-动作模型,并支持与语言的对齐,这表明重建可作为空间理解的一项强大且可扩展的代理任务。项目页面:http://vggt-omega.github.io/