Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/
翻译:统一多模态理解/生成模型通过将细粒度理解融入思维链过程,已展现出更优的图像编辑性能。然而,一个关键问题仍待探索:何种形式的思维链与训练策略能同时增强理解粒度与泛化性?为此,我们提出Meta-CoT范式,通过双重层次分解任意单图像编辑操作,该范式具备两个核心特性:(1) 可分解性。我们观察到任意编辑意图均可表示为三元组——(任务、目标、所需理解能力)。受此启发,Meta-CoT对编辑任务与目标进行双重分解,生成任务级联思维链并遍历所有目标上的编辑操作。这种分解增强了模型对编辑操作的理解粒度,引导其在训练过程中学习三元组的每个元素,显著提升编辑能力。(2) 泛化性。在第二分解层次中,我们进一步将编辑任务划分为五种基础元任务。研究发现,训练这五种元任务与三元组其他两元素相结合,足以实现对多样未见编辑任务的强泛化能力。为进一步对齐模型编辑行为与思维链推理,我们引入思维链编辑一致性奖励机制,鼓励在编辑过程中更准确有效地利用思维链信息。实验表明,本方法在21种编辑任务上实现平均15.8%的性能提升,且仅通过少量元任务训练即可有效泛化至未见编辑任务。代码、基准及模型已发布于https://shiyi-zh0408.github.io/projectpages/Meta-CoT/