无需训练的多模态扩散Transformer文本引导色彩编辑 (Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer)

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

翻译：图像与视频的文本引导色彩编辑是一个基础但尚未完全解决的难题，它需要对色彩属性（包括反照率、光源颜色和环境光照）进行细粒度操控，同时保持几何结构、材质属性以及光-物质相互作用的物理一致性。现有的无需训练方法虽在各类编辑任务中具有广泛适用性，但难以实现精确的色彩控制，并常在编辑区域与非编辑区域引入视觉不一致性。本研究提出ColorCtrl，一种无需训练的色彩编辑方法，它利用现代多模态扩散Transformer（MM-DiT）的注意力机制。通过对注意力图和值令牌进行针对性操控以实现结构与色彩的分离，我们的方法能够实现精确且一致的色彩编辑，并支持对属性强度的词级控制。该方法仅修改提示词指定的目标区域，而保持无关区域不变。在SD3和FLUX.1-dev上进行的大量实验表明，ColorCtrl超越了现有的无需训练方法，在编辑质量和一致性方面均达到了最先进的性能。此外，我们的方法在一致性方面超越了FLUX.1 Kontext Max和GPT-4o Image Generation等强大的商业模型。当扩展到如CogVideoX等视频模型时，我们的方法展现出更大优势，尤其是在保持时间连贯性和编辑稳定性方面。最后，我们的方法也能泛化到基于指令的编辑扩散模型，如Step1X-Edit和FLUX.1 Kontext dev，进一步证明了其多功能性。