VGGT-$Ω$ - 专知论文

Jianyuan Wang,Minghao Chen,Shangzhan Zhang,Nikita Karaev,Johannes Schönberger,Patrick Labatut,Piotr Bojanowski,David Novotny,Andrea Vedaldi,Christian Rupprecht

from arxiv, CVPR 2026 (Oral)

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

翻译：近期前馈重建模型（如VGGT）已展现出与传统优化类重建方法相抗衡的性能，同时还能提供对其他任务有用的几何感知特征。本文证明，此类模型的质量随模型规模和数据量呈可预测的缩放趋势。为此，我们提出VGGT-$Ω$，该模型在静态与动态场景中显著提升了重建精度、效率及能力。为实现前所未有的训练规模，我们引入了提升训练效率的架构改进、支持动态场景的高质量数据标注流程，以及自监督学习协议。通过采用带多任务监督的单一稠密预测头，并移除高分辨率卷积层，我们简化了VGGT的架构。此外，我们利用寄存器将场景信息凝聚为紧凑表征，并引入寄存器注意力机制——将帧间信息交换限制于这些寄存器中，部分替代全局注意力。采用这些设计后，VGGT-$Ω$在训练时消耗的GPU内存仅为前代模型的30%，这使我们能使用比之前工作多15倍的监督数据进行训练，并充分利用海量无标注视频数据。VGGT-$Ω$在多个基准上对静态与动态场景的重建均取得优异结果，例如在Sintel数据集上的相机位姿估计精度较此前最优方法提升77%。我们还证明，所学得的寄存器可增强视觉-语言-动作模型，并支持与语言的对齐，这表明重建可作为空间理解的一项强大且可扩展的代理任务。项目页面：http://vggt-omega.github.io/

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

8+阅读 · 6月16日

前馈式三维场景建模

专知会员服务

12+阅读 · 4月17日

【CVPR2025】超图视觉Transformer：图像不仅仅是节点，也不仅仅是边

专知会员服务

13+阅读 · 2025年4月14日

CVPR2025最新《Transformer模型》论文速读

专知会员服务

26+阅读 · 2025年3月17日