Generating accurate 3D models is a challenging problem that traditionally requires explicit learning from 3D datasets using supervised learning. Although recent advances have shown promise in learning 3D models from 2D images, these methods often rely on well-structured datasets with multi-view images of each instance or camera pose information. Furthermore, these datasets usually contain clean backgrounds with simple shapes, making them expensive to acquire and hard to generalize, which limits the applicability of these methods. To overcome these limitations, we propose a method for reconstructing 3D geometry from the diverse and unstructured Imagenet dataset without camera pose information. We use an efficient triplane representation to learn 3D models from 2D images and modify the architecture of the generator backbone based on StyleGAN2 to adapt to the highly diverse dataset. To prevent mode collapse and improve the training stability on diverse data, we propose to use multi-view discrimination. The trained generator can produce class-conditional 3D models as well as renderings from arbitrary viewpoints. The class-conditional generation results demonstrate significant improvement over the current state-of-the-art method. Additionally, using PTI, we can efficiently reconstruct the whole 3D geometry from single-view images.
翻译:从二维图像中生成精确的三维模型是一个具有挑战性的问题,传统方法需借助三维数据集进行监督学习。尽管近期研究在从二维图像学习三维模型方面取得进展,但这些方法通常依赖包含多视角图像或相机位姿信息的结构化数据集。此外,这些数据集往往具有干净背景和简单形状,导致获取成本高昂且难以泛化,限制了方法的适用范围。为突破上述局限,本文提出一种无需相机位姿信息、能从多样且非结构化的ImageNet数据集中重建三维几何结构的方法。我们采用高效的三平面表征从二维图像学习三维模型,并基于StyleGAN2改进生成器主干架构以适应高度多样化的数据集。为防止模式崩塌并提升多样化数据上的训练稳定性,我们引入多视角判别机制。训练后的生成器可生成类别条件三维模型及任意视角的渲染结果。类别条件生成结果表明,该方法显著优于当前最先进技术。此外,结合PTI方法,我们可基于单视角图像高效重建完整三维几何结构。