High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.
翻译:高分辨率图像对于精确的三维重建至关重要,因为许多几何细节仅在精细空间尺度上呈现。最近的馈通方法,例如视觉几何基础变换器(VGGT),已展现出从大量图像集合中单次前向推理即可推断场景几何的能力。然而,将这些模型扩展到高分辨率输入仍面临挑战:变换器架构中的令牌数量随图像分辨率和视图数量快速增长,导致高昂的计算和内存成本。此外,我们观察到视觉模糊区域(如重复图案、弱纹理或镜面表面)常产生不稳定的特征令牌,这会降低几何推理质量,尤其在更高分辨率下更为显著。我们提出HD-VGGT,一种用于高效鲁棒高分辨率三维重建的双分支架构。低分辨率分支预测粗粒度的全局一致几何,而高分辨率分支通过学习的特征上采样模块细化细节。为处理不稳定令牌,我们提出特征调制方法,可在变换器早期抑制不可靠特征。HD-VGGT无需全分辨率变换器计算成本即可利用高分辨率图像和监督信号,实现最先进的重建质量。