An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis

We consider the problem of constraining diffusion model outputs with a user-supplied reference image. Our key objective is to extract multiple attributes (e.g., color, object, layout, style) from this single reference image, and then generate new samples with them. One line of existing work proposes to invert the reference images into a single textual conditioning vector, enabling generation of new samples with this learned token. These methods, however, do not learn multiple tokens that are necessary to condition model outputs on the multiple attributes noted above. Another line of techniques expand the inversion space to learn multiple embeddings but they do this only along the layer dimension (e.g., one per layer of the DDPM model) or the timestep dimension (one for a set of timesteps in the denoising process), leading to suboptimal attribute disentanglement. To address the aforementioned gaps, the first contribution of this paper is an extensive analysis to determine which attributes are captured in which dimension of the denoising process. As noted above, we consider both the time-step dimension (in reverse denoising) as well as the DDPM model layer dimension. We observe that often a subset of these attributes are captured in the same set of model layers and/or across same denoising timesteps. For instance, color and style are captured across same U-Net layers, whereas layout and color are captured across same timestep stages. Consequently, an inversion process that is designed only for the time-step dimension or the layer dimension is insufficient to disentangle all attributes. This leads to our second contribution where we design a new multi-attribute inversion algorithm, MATTE, with associated disentanglement-enhancing regularization losses, that operates across both dimensions and explicitly leads to four disentangled tokens (color, style, layout, and object).

翻译：我们研究了利用用户提供的参考图像约束扩散模型输出的问题。核心目标是从单张参考图像中提取多重属性（如颜色、物体、布局、风格），并基于这些属性生成新样本。现有研究方向之一是将参考图像反演为单一文本条件向量，从而利用该学习到的令牌生成新样本。然而，这些方法无法学习必要数量的多重令牌来对模型输出施加前述多重属性的约束。另一类技术扩展了反演空间以学习多重嵌入，但仅沿层维度（如DDPM模型每层一个嵌入）或时间步维度（去噪过程中每组时间步一个嵌入）进行，导致属性解耦效果欠佳。针对上述不足，本文的首要贡献是对去噪过程中各维度捕获的属性进行系统分析。如前所述，我们同时考察了时间步维度（反向去噪过程）和DDPM模型层维度。观察到同一组模型层和/或相同去噪时间步常捕获部分属性子集。例如，颜色与风格由相同U-Net层捕获，而布局与颜色则在同一时间步阶段被捕获。因此，仅针对时间步维度或层维度设计的反演过程不足以解耦所有属性。这引出了我们的第二项贡献：设计了一种新型多属性反演算法MATTE，结合了增强解耦性的正则化损失函数，该算法跨越两个维度进行操作，并显式生成四个解耦的令牌（颜色、风格、布局、物体）。