UrbanWorld：面向三维城市生成的都市世界模型 (UrbanWorld: An Urban World Model for 3D City Generation)

Cities, as the essential environment of human life, encompass diverse physical elements such as buildings, roads and vegetation, which continuously interact with dynamic entities like people and vehicles. Crafting realistic, interactive 3D urban environments is essential for nurturing AGI systems and constructing AI agents capable of perceiving, decision-making, and acting like humans in real-world environments. However, creating high-fidelity 3D urban environments usually entails extensive manual labor from designers, involving intricate detailing and representation of complex urban elements. Therefore, accomplishing this automatically remains a longstanding challenge. Toward this problem, we propose UrbanWorld, the first generative urban world model that can automatically create a customized, realistic and interactive 3D urban world with flexible control conditions. UrbanWorld incorporates four key stages in the generation pipeline: flexible 3D layout generation from OSM data or urban layout with semantic and height maps, urban scene design with Urban MLLM, controllable urban asset rendering via progressive 3D diffusion, and MLLM-assisted scene refinement. We conduct extensive quantitative analysis on five visual metrics, demonstrating that UrbanWorld achieves SOTA generation realism. Next, we provide qualitative results about the controllable generation capabilities of UrbanWorld using both textual and image-based prompts. Lastly, we verify the interactive nature of these environments by showcasing the agent perception and navigation within the created environments. We contribute UrbanWorld as an open-source tool available at https://github.com/Urban-World/UrbanWorld.

翻译：城市作为人类生活的重要环境，包含建筑物、道路与植被等多样化的物理要素，这些要素持续与行人、车辆等动态实体进行交互。构建逼真且可交互的三维城市环境对于培育通用人工智能系统、构建能够在真实环境中像人类一样感知、决策与行动的智能体至关重要。然而，创建高保真度的三维城市环境通常需要设计师投入大量人工劳动，涉及复杂城市元素的精细刻画与表达。因此，实现自动化生成仍是长期存在的挑战。针对该问题，我们提出UrbanWorld——首个能够根据灵活控制条件自动生成定制化、逼真且可交互的三维都市世界的生成式城市世界模型。UrbanWorld在生成流程中整合了四个关键阶段：基于OSM数据或城市布局（含语义图与高度图）的灵活三维布局生成、基于Urban MLLM的城市场景设计、通过渐进式三维扩散实现的可控城市资产渲染，以及MLLM辅助的场景优化。我们在五项视觉指标上进行了广泛的定量分析，证明UrbanWorld达到了最先进的生成真实感。随后，我们通过文本与图像提示展示了UrbanWorld在可控生成能力方面的定性结果。最后，通过在生成环境中展示智能体的感知与导航功能，验证了这些环境的交互特性。我们将UrbanWorld作为开源工具发布于https://github.com/Urban-World/UrbanWorld。