Latent diffusion models for image generation have crossed a quality threshold which enabled them to achieve mass adoption. Recently, a series of works have made advancements towards replicating this success in the 3D domain, introducing techniques such as point cloud VAE, triplane representation, neural implicit surfaces and differentiable rendering based training. We take another step along this direction, combining these developments in a two-step pipeline consisting of 1) a triplane VAE which can learn latent representations of textured meshes and 2) a conditional diffusion model which generates the triplane features. For the first time this architecture allows conditional and unconditional generation of high quality textured or untextured 3D meshes across multiple diverse categories in a few seconds on a single GPU. It outperforms previous work substantially on image-conditioned and unconditional generation on mesh quality as well as texture generation. Furthermore, we demonstrate the scalability of our model to large datasets for increased quality and diversity. We will release our code and trained models.
翻译:用于图像生成的潜在扩散模型已经跨越了质量门槛,从而实现了大规模应用。最近,一系列研究工作在3D领域朝着复制这一成功的方向取得了进展,引入了点云变分自编码器、三平面表示、神经隐式表面和基于可微渲染训练等技术。我们沿着这一方向迈出了新的一步,将这些技术整合到一个两步流程中:1)一个三平面变分自编码器,能够学习纹理网格的潜在表征;2)一个条件扩散模型,用于生成三平面特征。该架构首次实现了在单个GPU上仅需数秒即可跨多个类别进行条件性和无条件性的高质量带纹理或不带纹理3D网格生成。在网格质量与纹理生成方面,它在图像条件生成和无条件生成任务上显著超越了先前的工作。此外,我们证明了模型在大数据集上的可扩展性,以提升质量与多样性。我们将开源我们的代码与训练模型。