Latent diffusion models for image generation have crossed a quality threshold which enabled them to achieve mass adoption. Recently, a series of works have made advancements towards replicating this success in the 3D domain, introducing techniques such as point cloud VAE, triplane representation, neural implicit surfaces and differentiable rendering based training. We take another step along this direction, combining these developments in a two-step pipeline consisting of 1) a triplane VAE which can learn latent representations of textured meshes and 2) a conditional diffusion model which generates the triplane features. For the first time this architecture allows conditional and unconditional generation of high quality textured or untextured 3D meshes across multiple diverse categories in a few seconds on a single GPU. It outperforms previous work substantially on image-conditioned and unconditional generation on mesh quality as well as texture generation. Furthermore, we demonstrate the scalability of our model to large datasets for increased quality and diversity. We will release our code and trained models.
翻译:图像生成的潜在扩散模型已跨越质量阈值,实现了大规模应用。近期,一系列研究在三维领域复制这一成功方面取得了进展,引入了点云变分自编码器、三平面表示、神经隐式曲面和基于可微渲染的训练等技术。我们沿着这一方向更进一步,将这些发展整合到一个两阶段流程中:1)三平面变分自编码器,能够学习纹理网格的潜在表示;2)条件扩散模型,用于生成三平面特征。该架构首次允许在单个GPU上数秒内实现跨多个不同类别的条件及无条件高质量纹理或非纹理三维网格生成。在网格质量及纹理生成方面,它在图像条件和无条件生成任务上显著超越了以往工作。此外,我们展示了模型在大规模数据集上的可扩展性,以提升质量和多样性。我们将公开代码和训练好的模型。