This paper presents CLIPXPlore, a new framework that leverages a vision-language model to guide the exploration of the 3D shape space. Many recent methods have been developed to encode 3D shapes into a learned latent shape space to enable generative design and modeling. Yet, existing methods lack effective exploration mechanisms, despite the rich information. To this end, we propose to leverage CLIP, a powerful pre-trained vision-language model, to aid the shape-space exploration. Our idea is threefold. First, we couple the CLIP and shape spaces by generating paired CLIP and shape codes through sketch images and training a mapper network to connect the two spaces. Second, to explore the space around a given shape, we formulate a co-optimization strategy to search for the CLIP code that better matches the geometry of the shape. Third, we design three exploration modes, binary-attribute-guided, text-guided, and sketch-guided, to locate suitable exploration trajectories in shape space and induce meaningful changes to the shape. We perform a series of experiments to quantitatively and visually compare CLIPXPlore with different baselines in each of the three exploration modes, showing that CLIPXPlore can produce many meaningful exploration results that cannot be achieved by the existing solutions.
翻译:本文提出CLIPXPlore,一种利用视觉语言模型引导三维形状空间探索的新框架。近年来,许多方法致力于将三维形状编码至潜在形状空间以实现生成式设计与建模。然而,现有方法虽能编码丰富信息,却缺乏有效的空间探索机制。为此,我们提出利用强大的预训练视觉语言模型CLIP辅助形状空间探索。我们的方法包含三个关键部分:首先,通过草图图像生成配对CLIP码与形状码,并训练映射网络连接两个空间,实现CLIP空间与形状空间的耦合;其次,为探索给定形状的邻域空间,我们设计协同优化策略,搜索与形状几何特征高度匹配的CLIP码;最后,我们构建三种探索模式——二元属性引导、文本引导、草图引导——在形状空间中定位合适的探索轨迹,并驱动形状产生语义化变形。我们通过系列实验,在三种探索模式下将CLIPXPlore与不同基线方法进行定量与视觉对比,结果表明CLIPXPlore能够生成现有方案无法实现的富有意义的探索结果。