Generating accurate 3D models is a challenging problem that traditionally requires explicit learning from 3D datasets using supervised learning. Although recent advances have shown promise in learning 3D models from 2D images, these methods often rely on well-structured datasets with multi-view images of each instance or camera pose information. Furthermore, these datasets usually contain clean backgrounds with simple shapes, making them expensive to acquire and hard to generalize, which limits the applicability of these methods. To overcome these limitations, we propose a method for reconstructing 3D geometry from the diverse and unstructured Imagenet dataset without camera pose information. We use an efficient triplane representation to learn 3D models from 2D images and modify the architecture of the generator backbone based on StyleGAN2 to adapt to the highly diverse dataset. To prevent mode collapse and improve the training stability on diverse data, we propose to use multi-view discrimination. The trained generator can produce class-conditional 3D models as well as renderings from arbitrary viewpoints. The class-conditional generation results demonstrate significant improvement over the current state-of-the-art method. Additionally, using PTI, we can efficiently reconstruct the whole 3D geometry from single-view images.
翻译:从二维图像生成精确的三维模型是一项具有挑战性的问题,传统方法通常需要基于三维数据集进行监督学习。尽管近期研究已展现出从二维图像学习三维模型的潜力,但现有方法往往依赖结构化数据集,要求提供每个实例的多视角图像或相机位姿信息。此外,这些数据集通常包含背景干净、形状简单的样本,导致获取成本高昂且泛化困难,严重限制了方法的适用性。为解决上述局限,本文提出了一种无需相机位姿信息即可从杂乱无章的ImageNet数据集中重建三维几何结构的方法。我们采用高效的三平面表征从二维图像中学习三维模型,并基于StyleGAN2改进生成器主干架构以适应高度多样化的数据集。为预防模式坍塌并提升多样化数据的训练稳定性,我们引入了多视角判别机制。训练后的生成器能够生成类别条件化的三维模型,并支持任意视角的渲染输出。类别条件化生成结果表明,该方法显著优于当前最先进技术。此外,结合PTI方法,我们可高效地从单视角图像中重建完整的三维几何结构。