Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.
翻译:互联网照片集呈现极端长尾分布:少数知名地标被密集拍摄并易于进行三维重建,而大多数真实场景则因图像稀疏、噪声大且分布不均,超出了经典方法和基于学习的三维重建技术的能力范围。我们认为,攻克这一长尾区域是三维基础模型下一阶段的前沿方向之一。尽管从稀疏场景中获取可靠的真实三维监督信息具有挑战性,但我们发现,通过从重建良好的互联网地标中采样稀疏子集可以有效地模拟这种监督。为此,我们提出了MegaDepth-X——一个包含干净、密集深度的大规模三维重建数据集,以及一种采样训练图像集的策略,该策略模拟长尾场景中的相机分布。利用这些组件微调三维基础模型,能在极度稀疏情况下实现稳健的三维重建,并在对称和重复场景中提升重建可靠性,同时保持对标准密集三维基准数据集的泛化能力。