Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch

Diffusion models enable high-fidelity image editing but can also be misused for unauthorized style imitation and harmful content generation. To mitigate these risks, proactive image protection methods embed small, often imperceptible adversarial perturbations into images before sharing to disrupt downstream editing or fine-tuning. However, in realistic post-release scenarios, content owners cannot control downstream processing pipelines, and protections optimized for a surrogate model may fail when attackers use mismatched diffusion pipelines. Existing purification methods can weaken protections but often sacrifice image quality and rarely examine architectural mismatch. We introduce a unified post-release purification framework to evaluate protection survivability under model mismatch. We propose two practical purifiers: VAE-Trans, which corrects protected images via latent-space projection, and EditorClean, which performs instruction-guided reconstruction with a Diffusion Transformer to exploit architectural heterogeneity. Both operate without access to protected images or defense internals. Across 2,100 editing tasks and six representative protection methods, EditorClean consistently restores editability. Compared to protected inputs, it improves PSNR by 3-6 dB and reduces FID by 50-70 percent on downstream edits, while outperforming prior purification baselines by about 2 dB PSNR and 30 percent lower FID. Our results reveal a purify-once, edit-freely failure mode: once purification succeeds, the protective signal is largely removed, enabling unrestricted editing. This highlights the need to evaluate protections under model mismatch and design defenses robust to heterogeneous attackers.

翻译：扩散模型能够实现高保真度的图像编辑，但也可能被滥用于未经授权的风格模仿和有害内容生成。为降低这些风险，主动式图像保护方法在分享图像前嵌入微小且通常难以察觉的对抗性扰动，以干扰下游编辑或微调过程。然而，在实际发布后的场景中，内容所有者无法控制下游处理流程，且针对代理模型优化的保护措施在攻击者使用不匹配的扩散流程时可能失效。现有的净化方法虽能削弱保护，但常以牺牲图像质量为代价，且很少考虑架构不匹配问题。本文提出一个统一的发布后净化框架，用于评估模型不匹配情况下保护的生存能力。我们提出了两种实用的净化器：VAE-Trans通过潜在空间投影校正受保护图像；EditorClean则利用扩散Transformer进行指令引导的重建，以利用架构异质性。两者均无需访问受保护图像或防御机制内部信息。在2,100项编辑任务和六种代表性保护方法上，EditorClean持续恢复了可编辑性。与受保护输入相比，它在下游编辑中将PSNR提升了3-6 dB，将FID降低了50-70%，同时以约2 dB的PSNR优势和30%更低的FID优于先前的净化基线方法。我们的结果揭示了一种“一次净化，自由编辑”的失效模式：一旦净化成功，保护信号即被大幅消除，从而实现无限制的编辑。这凸显了在模型不匹配条件下评估保护措施、并设计对异构攻击者鲁棒的防御机制的必要性。