We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which can hardly work for natural scenes. Our key idea to solve this challenging problem is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translate to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic, multi-view consistent videos of a variety of natural scenes.
翻译:我们提出了一种新颖的方法,该方法以单一语义遮罩为输入,合成自然场景的多视角一致彩色图像,并通过从互联网收集的单张图像集合进行训练。此前关于三维感知图像合成的研究要么需要多视角监督,要么需要学习特定物体类别的类别级先验,这难以适用于自然场景。我们解决这一挑战性问题的关键思路是使用语义场作为中间表示,该表示更容易从输入语义遮罩中重建,并借助现成的语义图像合成模型将其转换为辐射场。实验表明,我们的方法优于基线方法,能够生成多种自然场景的逼真、多视角一致的视频。