The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.
翻译:近期三维视觉领域的范式转变催生了基础模型的兴起,这些模型从未标定图像中感知三维信息的能力令人瞩目。然而,由于内存限制,将这些模型扩展到大规模RGB流三维重建仍然具有挑战性。本研究提出了S-MUSt3R,一种简单高效的流程,旨在扩展基础模型在单目三维重建中的能力边界。我们的方法通过序列分割、分段对齐与轻量级闭环优化的简单策略,解决了基础模型的可扩展性瓶颈。无需重新训练模型,我们即可受益于MUSt3R模型卓越的三维重建能力,并获得与传统复杂架构方法相媲美的轨迹与重建性能。我们在TUM、7-Scenes及私有机器人导航数据集上评估了S-MUSt3R,结果表明S-MUSt3R能够成功处理长RGB序列,并生成精确且一致的三维重建结果。我们的研究凸显了利用MUSt3R模型在真实世界场景中实现可扩展单目三维场景重建的潜力,其重要优势在于能够直接在度量空间中进行预测。