Given an original image, image editing aims to generate an image that align with the provided instruction. The challenges are to accept multimodal inputs as instructions and a scarcity of high-quality training data, including crucial triplets of source/target image pairs and multimodal (text and image) instructions. In this paper, we focus on image style editing and present StyleBooth, a method that proposes a comprehensive framework for image editing and a feasible strategy for building a high-quality style editing dataset. We integrate encoded textual instruction and image exemplar as a unified condition for diffusion model, enabling the editing of original image following multimodal instructions. Furthermore, by iterative style-destyle tuning and editing and usability filtering, the StyleBooth dataset provides content-consistent stylized/plain image pairs in various categories of styles. To show the flexibility of StyleBooth, we conduct experiments on diverse tasks, such as text-based style editing, exemplar-based style editing and compositional style editing. The results demonstrate that the quality and variety of training data significantly enhance the ability to preserve content and improve the overall quality of generated images in editing tasks. Project page can be found at https://ali-vilab.github.io/stylebooth-page/.
翻译:给定一张原始图像,图像编辑旨在生成与所提供指令一致的图像。其挑战在于需要接受多模态输入作为指令,同时面临高质量训练数据的稀缺性——包括关键性的源/目标图像对以及多模态(文本与图像)指令三元组。本文聚焦图像风格编辑任务,提出StyleBooth方法,该方法构建了完整的图像编辑框架,并设计了可行的策略用于建立高质量风格编辑数据集。我们将编码后的文本指令与图像示例融合为扩散模型的统一条件,从而支持按照多模态指令对原始图像进行编辑。此外,通过迭代式风格-去风格微调与编辑流程,结合可用性过滤,StyleBooth数据集提供了涵盖多种风格类别的内容一致性风格化/平面图像对。为展示StyleBooth的灵活性,我们在文本引导风格编辑、示例引导风格编辑及组合风格编辑等多样化任务上进行了实验。结果表明,高质量且多样化的训练数据显著提升了编辑任务中内容保留能力与生成图像的整体质量。项目页面详见https://ali-vilab.github.io/stylebooth-page/。