InstantSfM: Towards GPU-Native SfM for the Deep Learning Era

Structure-from-Motion (SfM) is a fundamental technique for recovering camera poses and scene structure from multi-view imagery, serving as a critical upstream component for applications ranging from 3D reconstruction to modern neural scene representations such as 3D Gaussian Splatting. However, most mature SfM systems remain CPU-centric and built upon traditional optimization toolchains, creating a growing mismatch with modern GPU-based, learning-driven pipelines and limiting scalability in large-scale scenes. While recent advances in GPU-accelerated bundle adjustment (BA) have demonstrated the potential of parallel sparse optimization, extending these techniques to build a complete global SfM system remains challenging due to unresolved issues in metric scale recovery and numerical robustness. In this paper, we implement a fully GPU-based and PyTorch-compatible global SfM system, named InstantSfM, to integrate seamlessly with modern learning pipelines. InstantSfM embeds metric depth priors directly into both global positioning and BA through a depth-constrained Jacobian structure, thereby resolving scale ambiguity within the optimization framework. To ensure numerical stability, we employ explicit filtering of under-constrained variables for the Jacobian matrix in an optimized GPU-friendly manner. Extensive experiments on diverse datasets demonstrate that InstantSfM achieves state-of-the-art efficiency while maintaining reconstruction accuracy comparable to both established classical pipelines and recent learning-based methods, showing up to ${\sim40\times}$ speedup over COLMAP on large-scale scenes.

翻译：运动恢复结构（SfM）是一种从多视角图像中恢复相机位姿与场景结构的基础技术，是从三维重建到现代神经场景表示（如3D高斯泼溅）等众多应用的关键上游组件。然而，大多数成熟的SfM系统仍以CPU为中心，并构建在传统的优化工具链之上，这与现代基于GPU、学习驱动的流程日益脱节，并限制了其在大规模场景中的可扩展性。尽管GPU加速的捆绑调整（BA）技术的最新进展已展示了并行稀疏优化的潜力，但由于在度量尺度恢复和数值鲁棒性方面仍存在未解决的问题，将这些技术扩展以构建一个完整的全局SfM系统仍然具有挑战性。本文实现了一个完全基于GPU且与PyTorch兼容的全局SfM系统，命名为InstantSfM，旨在与现代学习流程无缝集成。InstantSfM通过一种深度约束的雅可比矩阵结构，将度量深度先验直接嵌入到全局定位和BA中，从而在优化框架内解决了尺度模糊性问题。为确保数值稳定性，我们以优化的、GPU友好的方式对雅可比矩阵中约束不足的变量进行显式过滤。在多样化数据集上的大量实验表明，InstantSfM在保持与成熟的经典流程以及近期基于学习的方法相当的建图精度的同时，实现了最先进的效率，在大规模场景上相比COLMAP展现出高达${\sim40\times}$的加速比。