Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.
翻译:从无位姿的多视角图像重建连贯的三维几何与外观是计算机视觉中一项基础且具有挑战性的问题。现有的大多数视觉几何基础模型通过回归像素对齐的点图来预测显式几何,但这常常导致冗余和几何连续性受限。我们提出IVGT——一种隐式视觉几何变换器,能够从无位姿的多视角图像中隐式地建模连续且连贯的几何结构。该模型在标准坐标系中学习连续的神经场景表示,支持在任意三维位置进行连续空间查询,并通过轻量级解码器提取局部特征以预测有符号距离函数值及颜色。这允许直接提取连续且连贯的表面几何,从而能够从任意视点渲染RGB图像、深度图和表面法向图。我们通过多数据集联合优化来训练IVGT,结合二维监督与三维几何正则化。IVGT展示了跨场景的泛化能力,并在多种任务中取得了优异性能,包括网格与点云重建、新视角合成、深度与表面法向估计以及相机位姿估计。