Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape data, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a text-guided shape stylization module to dress up the output shapes with novel textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.
翻译:文本引导的3D形状生成因缺乏大规模文本-形状配对数据、两种模态间显著的语义鸿沟以及3D形状的结构复杂性而仍具挑战性。本文提出名为"以图像为基石"(ISS)的新框架,通过引入2D图像作为连接两种模态的基石,从而消除对文本-形状配对数据的需求。我们的核心贡献在于提出两阶段特征空间对齐方法:利用预训练的单视图重建(SVR)模型与多视图监督,将CLIP特征映射至形状空间——首先将CLIP图像特征映射至SVR模型中富含细节的形状空间,再将CLIP文本特征映射至形状空间,并通过强化输入文本与渲染图像间的CLIP一致性来优化映射。此外,我们设计了文本引导的形状风格化模块,为输出形状赋予新颖纹理。与现有文本生成3D形状的工作不同,本方法具有通用性,可在无需文本-形状配对数据的情况下生成广泛类别的形状。实验结果表明,本方法在保真度及与文本一致性方面均超越现有最优方法及基线模型。进一步地,本方法能以写实与奇幻兼具的结构和纹理对生成形状进行风格化处理。