Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation, however, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.
翻译:基于扩散的方法在二维图像或三维物体生成方面取得了显著成就,然而,由于场景数据集数量有限、三维场景本身的复杂性以及生成一致多视图图像的困难,三维场景乃至$360^{\circ}$图像的生成仍然受到限制。为解决这些问题,我们首先建立了一个大规模全景视频-文本数据集,其中包含数百万个连续的全景关键帧及其对应的全景深度、相机位姿和文本描述。随后,我们提出了一种新颖的文本驱动全景生成框架,称为DiffPano,以实现可扩展、一致且多样化的全景场景生成。具体而言,得益于稳定扩散模型强大的生成能力,我们在所建立的全景视频-文本数据集上使用LoRA对单视图文本到全景扩散模型进行微调。我们进一步设计了一种球面极线感知多视图扩散模型,以确保生成的全景图像具有多视图一致性。大量实验表明,DiffPano能够根据给定的未见文本描述和相机位姿,生成可扩展、一致且多样化的全景图像。