Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-driven image editing methods usually need edit masks, struggle with edits that require significant visual changes and cannot easily keep specific details of the edited portion. In this paper we make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image. We also show that initializing the stochastic sampler with a noised version of the base image before the sampling and interpolating relevant details from the base image after sampling further increase the quality of the edit operation. Combining these observations, we propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image. UniTune does not require additional inputs, like masks or sketches, and can perform multiple edits on the same image without retraining. We test our method using the Imagen model in a range of different use cases. We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.
翻译:文本驱动的图像生成方法近来取得了显著成果,使普通用户仅通过提供文本描述即可生成高质量图像。然而,针对现有图像的类似编辑能力仍难以实现。文本驱动的图像编辑方法通常需要编辑遮罩,难以应对需要大幅度视觉变化的编辑操作,且难以保留编辑区域的特定细节。本文发现,图像生成模型可通过在单张图像上进行微调直接转化为图像编辑模型。我们还证明,在采样前将带噪版本的基础图像作为随机采样器的初始化,并在采样后从基础图像中插值相关细节,能进一步提升编辑质量。综合这些发现,我们提出了一种新型图像编辑方法UniTune。UniTune以任意图像和文本编辑描述为输入,在保持对输入图像高保真度的同时执行编辑操作。该方法无需额外输入(如遮罩或草图),且可在不重新训练的情况下对同一图像执行多次编辑。我们使用Imagen模型在多种不同用例中测试了该方法,结果表明其具有广泛适用性,能够执行包括此前无法实现的大幅视觉变化在内的一系列极具表现力的编辑操作。