Semantic-driven 3D shape generation aims to generate 3D objects conditioned on text. Previous works face problems with single-category generation, low-frequency 3D details, and requiring a large number of paired datasets for training. To tackle these challenges, we propose a multi-category conditional diffusion model. Specifically, 1) to alleviate the problem of lack of large-scale paired data, we bridge the text, 2D image and 3D shape based on the pre-trained CLIP model, and 2) to obtain the multi-category 3D shape feature, we apply the conditional flow model to generate 3D shape vector conditioned on CLIP embedding. 3) to generate multi-category 3D shape, we employ the hidden-layer diffusion model conditioned on the multi-category shape vector, which greatly reduces the training time and memory consumption.
翻译:语义驱动的三维形状生成旨在根据文本条件生成三维物体。先前的工作面临单类别生成、低频三维细节以及需要大量配对数据集进行训练的问题。为了解决这些挑战,我们提出了一种多类别条件扩散模型。具体而言:1)为了缓解大规模配对数据缺乏的问题,我们基于预训练的CLIP模型构建了文本、二维图像和三维形状之间的桥梁;2)为获取多类别三维形状特征,我们应用条件流模型来生成以CLIP嵌入为条件的三维形状向量;3)为生成多类别三维形状,我们采用以多类别形状向量为条件的隐层扩散模型,这大幅减少了训练时间和内存消耗。