We study to generate novel views of indoor scenes given sparse input views. The challenge is to achieve both photorealism and view consistency. We present SparseGNV: a learning framework that incorporates 3D structures and image generative models to generate novel views with three modules. The first module builds a neural point cloud as underlying geometry, providing contextual information and guidance for the target novel view. The second module utilizes a transformer-based network to map the scene context and the guidance into a shared latent space and autoregressively decodes the target view in the form of discrete image tokens. The third module reconstructs the tokens into the image of the target view. SparseGNV is trained across a large indoor scene dataset to learn generalizable priors. Once trained, it can efficiently generate novel views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV on both real-world and synthetic indoor scenes and demonstrate that it outperforms state-of-the-art methods based on either neural radiance fields or conditional image generation.
翻译:我们研究在稀疏输入视图条件下生成室内场景的新视角。其挑战在于同时实现照片级真实感与视角一致性。我们提出SparseGNV:一种融合三维结构与图像生成模型的学习框架,通过三个模块生成新视角。第一个模块构建作为底层几何结构的神经点云,为目标新视角提供上下文信息与引导。第二个模块采用基于Transformer的网络,将场景上下文与引导映射到共享潜在空间,并以离散图像令牌形式自回归解码目标视角。第三个模块将令牌重建为目标视角图像。SparseGNV在大型室内场景数据集上训练,学习可泛化先验。训练完成后,它能以前馈方式高效生成未见室内场景的新视角。我们在真实与合成室内场景上评估SparseGNV,证明其优于基于神经辐射场或条件图像生成的最先进方法。