Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.
翻译:先前利用视频模型进行图像到三维场景生成的研究往往存在几何失真和内容模糊的问题。本文通过释放几何模型的潜力,革新了图像到三维场景生成的流程,并提出了我们的GeoWorld。我们不再利用单帧输入获得的几何信息,而是提出首先生成连续视频帧,然后利用几何模型提供全帧几何特征,这些特征比先前方法中使用的单帧深度图或相机嵌入包含更丰富的信息,并将这些几何特征作为几何条件来辅助视频生成模型。为了增强几何结构的一致性,我们进一步提出了几何对齐损失,为模型提供真实世界的几何约束,以及一个几何适应模块,以确保几何特征的有效利用。大量实验表明,我们的GeoWorld能够从单张图像和给定的相机轨迹生成高保真的三维场景,在定性和定量上均优于先前的方法。项目页面:https://peaes.github.io/GeoWorld/。