Any Resolution Any Geometry: From Multi-View To Multi-Patch

Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.

翻译：表面法线与深度的联合估计对于全面的三维场景理解至关重要，然而，高分辨率预测仍然困难，原因在于保留精细局部细节与维持全局一致性之间存在权衡。为应对这一挑战，我们提出了超高分辨率几何Transformer（URGT），它将视觉几何基础Transformer（VGGT）适配为一个统一的单目高分辨率深度-法线估计多块Transformer。单张高分辨率图像被分割成多个块，这些块通过预训练模型获得的粗略深度和法线先验进行增强，并在一次前向传播中联合处理，以预测精细化后的几何输出。全局一致性通过跨块注意力机制得以加强，该机制在共享骨干网络中实现了长程几何推理以及信息在块间的无缝传播。为进一步增强空间鲁棒性，我们引入了一种GridMix块采样策略，在训练过程中概率性地采样网格配置，从而改善了块间一致性和泛化能力。我们的方法在UnrealStereo4K数据集上取得了最先进的结果，同时改进了深度和法线估计，将AbsRel从0.0582降至0.0291，RMSE从2.17降至1.31，并将平均角度误差从23.36度降至18.51度，同时生成更清晰、更稳定的几何结构。所提出的多块框架还展现出强大的零样本和跨域泛化能力，并能有效扩展到极高分辨率，为高质量几何精细化提供了一个高效且可扩展的解决方案。