Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/.
翻译:尽管图像质量日益逼真,但现有的3D图像生成模型通常作用于固定范围的3D体积,且相机运动有限。我们研究了无条件合成无界自然场景的任务,使得在保持持久3D世界模型的同时,能够实现任意大幅度的相机运动。我们的场景表示由一个可扩展的平面场景布局网格和一个全景天穹组成,该网格可以通过3D解码器和体积渲染从任意相机姿态进行渲染。基于这种表示,我们仅从单视角互联网照片中学习一个生成世界模型。我们的方法能够模拟穿越3D景观的长途飞行,同时保持全局场景一致性——例如,返回起点将得到相同的场景视图。我们的方法实现了超出当前3D生成模型固定边界的场景外推,同时支持一种持久且独立于相机的世界表示,这与自回归式3D预测模型形成鲜明对比。项目页面:https://chail.github.io/persistent-nature/。