CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

翻译：条件性图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景（例如驾驶场景中的异常插入和复杂人体姿态变换）中至关重要。尽管大规模编辑模型（如Seedream、Nano Banana等）近期取得了进展，但大多数方法仍依赖单步生成。这种范式通常缺乏明确的质量控制，可能导致与原图像过度偏离，并频繁产生结构伪影或环境不一致的修改，通常需要手动调整提示词才能获得可接受的结果。我们提出了 **CAMEO**，这是一个结构化多智能体框架，将条件性编辑重新定义为质量感知、反馈驱动的过程，而非一次性生成任务。CAMEO将编辑分解为规划、结构化提示、假设生成和自适应参考定位等协调阶段，仅在任务复杂性要求时才调用外部指导。为克服现有方法内在质量控制的缺失，评估被直接嵌入编辑循环中。通过结构化反馈对中间结果进行迭代改进，形成一种闭环过程，逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估了CAMEO。在多个强大的编辑骨干网络和独立评估模型上，CAMEO相比多个最先进模型平均胜率一致高出20%，展现了在条件性图像编辑中改进的鲁棒性、可控性和结构可靠性。