Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder into a two-stage vector-quantized autoencoder. Our Stage 1 allows for the reconstruction of an input image and the ability to change the camera position around the image, and our Stage 2 allows for the generation of new 3D scenes. VQ3D is capable of generating and reconstructing 3D-aware images from the 1000-class ImageNet dataset of 1.2 million training images. We achieve an ImageNet generation FID score of 16.8, compared to 69.8 for the next best baseline method.
翻译:近期研究表明,从二维图像集合训练三维内容生成模型的可能性已在小型数据集(如人脸、动物面部或汽车等单一物体类别)上得到验证。然而,这些模型在处理更大规模、更复杂的数据集时表现欠佳。为建模ImageNet这类多样且无约束的图像集合,我们提出VQ3D,该模型将基于NeRF的解码器引入两阶段矢量量化自编码器。第一阶段实现输入图像的重建及围绕图像的相机位置变换,第二阶段则支持新三维场景的生成。VQ3D能够从包含120万张训练图像的1000类ImageNet数据集中生成和重建三维感知图像。我们取得了ImageNet生成FID分数16.8的结果,而次优基线方法仅为69.8。