We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training. Existing supervised methods depend on datasets containing triplets of input image, edited image, and edit instruction. These are generated by either existing editing methods or human-annotations, which introduce biases and limit their generalization ability. Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency (CEC), which applies forward and backward edits in one training step and enforces consistency in image and attention spaces. This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-edit triplets. We empirically show that our unsupervised technique performs better across a broader range of edits with high fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with supervised methods, and proposing CEC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.
翻译:本文提出一种无监督的指令驱动图像编辑模型,该模型在训练过程中无需使用真实编辑图像作为监督信号。现有监督方法依赖于包含输入图像、编辑后图像及编辑指令的三元组数据集,这些数据通常由现有编辑方法生成或通过人工标注获得,从而引入了偏差并限制了模型的泛化能力。为解决这些问题,我们提出一种称为循环编辑一致性的新型编辑机制,该机制在单步训练中同时执行正向与反向编辑,并在图像空间与注意力空间强制保持一致性。这一设计使我们能够摆脱对真实编辑图像的依赖,首次实现在仅包含真实图像-描述对或图像-描述-编辑三元组的数据集上进行训练。实验结果表明,我们的无监督技术能够在更广泛的编辑任务中实现更高保真度与精确度的性能表现。通过消除对预构建三元组数据集的需求、减少监督方法带来的偏差,并引入循环编辑一致性机制,本工作为突破指令驱动图像编辑的规模化瓶颈提供了重要进展。