We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.
翻译:我们通过反事实推理研究单幅图像的文本图像编辑(TBIE),因为该形式化方法能够精确满足核心需求:编辑后的图像应保持原图的保真度。通过该形式化框架的视角,我们发现TBIE的关键难题在于现有技术难以在可编辑性与保真度之间取得良好平衡,其主要原因在于单幅图像微调中的过拟合问题。为此,我们提出了一种双向溯因反事实推理框架(DAC)。首先,我们将外生变量参数化为UNet LoRA,其溯因过程能够编码所有图像细节。其次,我们通过文本编码器LoRA参数化另一个外生变量,该变量恢复了因首次溯因过拟合而丧失的可编辑性。得益于第二次溯因——其专门编码编辑后到编辑前的视觉转换——其逆操作(即减去LoRA)能有效将编辑前图像还原为编辑后图像,从而完成编辑任务。通过大量实验,我们的DAC在可编辑性与保真度之间取得了良好的平衡。因此,我们能够支持广泛的用户编辑意图,包括目标添加、移除、操控、替换、风格迁移及人脸变换。这些能力在定性与定量评估中均得到充分验证。代码见https://github.com/xuesong39/DAC。