GalaxyEdit: Large-Scale Image Editing Dataset with Enhanced Diffusion Adapter

Training of large-scale text-to-image and image-to-image models requires a huge amount of annotated data. While text-to-image datasets are abundant, data available for instruction-based image-to-image tasks like object addition and removal is limited. This is because of the several challenges associated with the data generation process, such as, significant human effort, limited automation, suboptimal end-to-end models, data diversity constraints and high expenses. We propose an automated data generation pipeline aimed at alleviating such limitations, and introduce GalaxyEdit - a large-scale image editing dataset for add and remove operations. We fine-tune the SD v1.5 model on our dataset and find that our model can successfully handle a broader range of objects and complex editing instructions, outperforming state-of-the-art methods in FID scores by 11.2\% and 26.1\% for add and remove tasks respectively. Furthermore, in light of on-device usage scenarios, we expand our research to include task-specific lightweight adapters leveraging the ControlNet-xs architecture. While ControlNet-xs excels in canny and depth guided generation, we propose to improve the communication between the control network and U-Net for more intricate add and remove tasks. We achieve this by enhancing ControlNet-xs with non-linear interaction layers based on Volterra filters. Our approach outperforms ControlNet-xs in both add/remove and canny-guided image generation tasks, highlighting the effectiveness of the proposed enhancement.

翻译：大规模文本到图像及图像到图像模型的训练需要海量标注数据。尽管文本到图像数据集较为丰富，但可用于基于指令的图像到图像任务（如对象添加与移除）的数据却十分有限。这主要源于数据生成过程面临的诸多挑战，包括：高昂的人力成本、有限的自动化程度、次优的端到端模型、数据多样性约束以及昂贵的生成开销。本文提出一种旨在缓解上述限制的自动化数据生成流程，并推出GalaxyEdit——一个专注于添加与移除操作的大规模图像编辑数据集。我们在该数据集上对SD v1.5模型进行微调，实验表明我们的模型能够成功处理更广泛的对象类别和复杂的编辑指令，在添加与移除任务上的FID分数分别超越现有最优方法11.2%和26.1%。此外，针对设备端部署场景，我们进一步拓展研究范围，引入基于ControlNet-xs架构的轻量化任务适配器。虽然ControlNet-xs在边缘检测与深度引导生成方面表现优异，但为应对更复杂的添加与移除任务，我们提出通过改进控制网络与U-Net之间的信息交互机制来提升性能。具体而言，我们基于Volterra滤波器设计非线性交互层对ControlNet-xs进行增强。实验证明，该方法在添加/移除任务及边缘引导图像生成任务中均优于原始ControlNet-xs，充分验证了所提增强方案的有效性。