Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.
翻译:生成模型已实现利用自然语言进行直观的图像生成与编辑。特别是扩散模型在自然图像编辑领域近期展现出卓越成果。本文提出将扩散技术应用于纹理编辑——作为三维内容创建流程核心要素的一类特殊图像。通过分析现有编辑方法,我们发现其共同底层机制(操作注意力图)因不适用于纹理域而无法直接应用。为此,我们提出一种操控CLIP图像嵌入以调节扩散生成过程的新方法:利用简单文本提示定义编辑方向(如"旧木板"至"新木板"),并通过基于采样的纹理先验将其映射至CLIP图像嵌入空间,从而获得保持对象身份的CLIP空间方向。为进一步增强身份保持能力,我们将这些方向投影至可最小化纠缠纹理属性引发身份变化的CLIP子空间。该编辑流水线仅需自然语言提示即可创建任意滑块,且无需任何真实标注数据。