CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

翻译：统一的扩散编辑器通常依赖一个固定的共享主干网络来处理多样化任务，这容易遭受任务干扰，并且难以适应异构需求（例如，局部与全局、语义与光度）。具体而言，流行的ControlNet和OmniControl变体通过静态拼接或加性适配器组合多种条件信号（例如，文本、掩码、参考图像），无法动态地优先处理或抑制相互冲突的模态，从而导致诸如颜色在掩码边界溢出、身份或风格漂移，以及在多条件输入下行为不可预测等伪影。为解决此问题，我们提出了条件感知专家路由（CARE-Edit），它将模型计算与特定的编辑能力对齐。其核心是一个轻量级的潜在注意力路由器，它根据多模态条件和扩散时间步，将编码后的扩散令牌分配给四个专业专家——文本、掩码、参考图像和基础专家：（i）掩码重绘模块首先细化用户定义的粗略掩码，以提供精确的空间引导；（ii）路由器应用稀疏的Top-K选择，动态地将计算分配给最相关的专家；（iii）随后的潜在混合模块融合各专家的输出，将语义、空间和风格信息连贯地整合到基础图像中。实验验证了CARE-Edit在上下文编辑任务上的强大性能，包括擦除、替换、文本驱动编辑和风格迁移。实证分析进一步揭示了各专业专家的任务特定行为，展示了动态、条件感知的处理对于缓解多条件冲突的重要性。