Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.
翻译:获取精细的三维场景通常需要昂贵的设备、多视角数据或费时费力的建模工作。因此,一种轻量化的替代方案——从单张俯视图像生成复杂三维场景——在实际应用中具有重要作用。尽管近期三维生成模型在物体级别已取得显著成果,但其扩展到全场景生成时,常出现几何不一致、布局幻觉和网格质量低下等问题。本研究提出了3DTown,一种无需训练即可从单张俯视图像合成逼真且连贯三维场景的框架。我们的方法基于两大原则:通过区域化生成提升图像到三维的对齐精度与分辨率,以及通过空间感知的三维修复确保全局场景连贯性与高质量几何生成。具体而言,我们将输入图像分解为重叠区域,利用预训练的三维物体生成器分别生成各区域,再通过掩码修正流修复过程填补缺失几何结构,同时保持空间连续性。这种模块化设计使我们能够突破分辨率瓶颈并保持空间结构,且无需三维监督或模型微调。在多样化场景上的大量实验表明,3DTown在几何质量、空间连贯性和纹理保真度方面均优于现有先进基线方法(包括Trellis、Hunyuan3D-2和TripoSG)。研究结果证明,通过基于原理的无训练方法,从单张图像实现高质量三维城镇生成是切实可行的。