Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.
翻译:尽管基于UNet的图像编辑方法近期取得了进展,但在高分辨率图像中实现形状感知的对象编辑方法仍然匮乏。与UNet相比,扩散Transformer(DiT)展现出更优异的捕获图像块间长程依赖关系的能力,从而生成更高质量的图像。本文提出DiT4Edit,首个基于扩散Transformer的图像编辑框架。具体而言,DiT4Edit采用DPM-Solver反演算法获取反演潜在表示,相比基于UNet框架常用的DDIM反演算法减少了步骤数。此外,我们设计了统一注意力控制与图像块合并机制,专门适配Transformer的计算流程。该集成使我们的框架能够更快地生成更高质量的编辑图像。我们的设计充分发挥了DiT的优势,使其在图像编辑任务中超越UNet结构,尤其在高分辨率和任意尺寸图像处理方面。大量实验表明,DiT4Edit在多种编辑场景中均表现出强大性能,彰显了扩散Transformer在支持图像编辑领域的潜力。