We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K scenes - over 50 times larger than the prior real dataset DL3DV - dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data. Experiment results show that joint training or pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB PSNR across diverse image domains. Moreover, models trained solely on MegaSynth perform comparably to those trained on real data, underscoring the low-level nature of 3D reconstruction. Additionally, we provide an in-depth analysis of MegaSynth's properties for enhancing model capability, training stability, and generalization.
翻译:我们提出通过使用合成数据进行训练来扩展三维场景重建的规模。我们工作的核心是MegaSynth,这是一个程序化生成的三维数据集,包含70万个场景,其规模是先前真实数据集DL3DV的50倍以上,从而极大地扩展了训练数据。为了实现可扩展的数据生成,我们的核心思想是消除语义信息,从而无需对物体可供性和场景构成等复杂的语义先验进行建模。相反,我们使用基本的空间结构和几何基元对场景进行建模,这提供了可扩展性。此外,我们控制数据复杂度以促进训练,同时将其与真实世界的数据分布进行松散对齐,以利于真实世界的泛化。我们探索了使用MegaSynth和现有真实数据对LRMs进行训练。实验结果表明,使用MegaSynth进行联合训练或预训练,可在多个图像域上将重建质量提高1.2至1.8 dB的PSNR。此外,仅使用MegaSynth训练的模型与使用真实数据训练的模型性能相当,这突显了三维重建的低层次特性。此外,我们还深入分析了MegaSynth在增强模型能力、训练稳定性和泛化性方面的特性。