Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.
翻译:文本到图像扩散模型在生成图像时常常对现实世界做出隐式假设。虽然某些假设是有益的(例如“天空是蓝色的”),但它们也可能过时、错误,或反映训练数据中存在的社会偏见。因此,需要在不依赖显式用户输入或高成本重新训练的情况下控制这些假设。在本工作中,我们旨在编辑预训练扩散模型中的特定隐式假设。我们的文本到图像模型编辑方法——简称TIME——接收一对输入:一个“源”低限定提示,模型对其做出隐式假设(例如“一束玫瑰”),以及一个“目标”提示,描述相同场景但指定了期望属性(例如“一束蓝色玫瑰”)。TIME随后更新模型的交叉注意力层,因为这些层将文本标记赋予视觉含义。我们编辑这些层中的投影矩阵,使源提示的投影接近目标提示的投影。本方法效率极高,仅修改模型参数中的2.2%,且耗时不足一秒。为评估模型编辑方法,我们引入TIMED(TIME数据集),包含来自不同领域的147对源-目标提示。我们的实验(使用Stable Diffusion)表明,TIME在模型编辑中表现成功,对于编辑时未见的相关提示具有良好的泛化能力,并对无关生成的影响极小。