Existing 3D-from-2D generators are typically designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location, and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop a 3D generator with Generic Priors (3DGP): a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet. Our model is based on three new ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision. Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training. Finally, we extend the recent ideas of transferring knowledge from pre-trained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator. It achieves more stable training than the existing methods and speeds up the convergence by at least 40%. We explore our model on four datasets: SDIP Dogs 256x256, SDIP Elephants 256x256, LSUN Horses 256x256, and ImageNet 256x256, and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality. Code and visualizations: https://snap-research.github.io/3dgp.
翻译:现有的2D到3D生成器通常针对精心整理的单类别数据集设计,其中所有物体具有(近似)相同的尺度、三维位置和朝向,且相机始终指向场景中心。这使它们无法适用于从任意相机姿态渲染的非对齐场景构成的多样化、真实场景数据集。本文提出一种具有通用先验的3D生成器(3DGP):一种对训练数据假设更通用的三维合成框架,并证明其可扩展至ImageNet等高难度数据集。我们的模型基于三项创新:首先,通过专用深度自适应模块处理精度问题,将现成的不准确深度估计器融入3D GAN训练;其次,构建灵活相机模型并设计正则化策略,在训练过程中学习其分布参数;最后,采用判别器上的简化蒸馏技术,将预训练分类器知识迁移至补丁式训练模型的GAN框架。该方法比现有方法训练更稳定,收敛速度提升至少40%。我们在SDIP Dogs 256x256、SDIP Elephants 256x256、LSUN Horses 256x256和ImageNet 256x256四个数据集上验证模型,证明3DGP在纹理和几何质量方面均优于最新技术。代码与可视化结果:https://snap-research.github.io/3dgp。