This paper propels creative control in generative visual AI by allowing users to "select". Departing from traditional text or sketch-based methods, we for the first time allow users to choose visual concepts by parts for their creative endeavors. The outcome is fine-grained generation that precisely captures selected visual concepts, ensuring a holistically faithful and plausible result. To achieve this, we first parse objects into parts through unsupervised feature clustering. Then, we encode parts into text tokens and introduce an entropy-based normalized attention loss that operates on them. This loss design enables our model to learn generic prior topology knowledge about object's part composition, and further generalize to novel part compositions to ensure the generation looks holistically faithful. Lastly, we employ a bottleneck encoder to project the part tokens. This not only enhances fidelity but also accelerates learning, by leveraging shared knowledge and facilitating information exchange among instances. Visual results in the paper and supplementary material showcase the compelling power of PartCraft in crafting highly customized, innovative creations, exemplified by the "charming" and creative birds. Code is released at https://github.com/kamwoh/partcraft.
翻译:本文通过允许用户“选择”视觉概念,推动了生成式视觉人工智能中的创意控制。与传统基于文本或草图的方法不同,我们首次允许用户为其创意构思按部件选择视觉概念。其结果是细粒度的生成,能够精确捕捉选定的视觉概念,并确保整体上忠实且合理的结果。为实现此目标,我们首先通过无监督特征聚类将物体解析为部件。然后,我们将部件编码为文本标记,并引入一种基于熵的归一化注意力损失函数作用于这些标记。该损失设计使我们的模型能够学习关于物体部件构成的通用先验拓扑知识,并能进一步泛化至新颖的部件组合,以确保生成结果在整体上看起来忠实可信。最后,我们采用一个瓶颈编码器来投影部件标记。这不仅通过利用共享知识和促进实例间的信息交换来增强保真度,还加速了学习过程。论文及补充材料中的可视化结果展示了PartCraft在构建高度定制化、创新性作品(例如“迷人”且富有创意的鸟类)方面的强大能力。代码发布于 https://github.com/kamwoh/partcraft。