Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
翻译:尽管SDXL和FLUX等文本到图像(T2I)扩散模型已实现令人印象深刻的照片级真实感,但在肢体、面部、文字等方面仍普遍存在小尺度失真。现有的优化方法要么执行成本高昂的迭代重新生成,要么依赖空间定位能力较弱的视觉语言模型(VLM),导致语义漂移和不可靠的局部编辑。为弥补这一差距,我们提出了智能修图代理,这是一个分层决策驱动框架,将生成后修正重新表述为类人的感知-推理-行动循环。具体而言,我们设计了(1)一个感知代理,在文本-图像一致性线索下学习上下文显著性以实现细粒度失真定位;(2)一个推理代理,通过渐进式偏好对齐执行符合人类认知的推断性诊断;(3)一个行动代理,在用户偏好引导下自适应地规划局部修复。该设计将感知证据、语言推理和可控修正整合到一个统一的自我校正决策过程中。为实现细粒度监督和定量评估,我们进一步构建了GenBlemish-27K数据集,包含6K张T2I图像,涵盖12个类别的27K个标注伪影区域。大量实验表明,智能修图代理在感知质量、失真定位和人类偏好对齐方面持续优于最先进方法,为自我校正且感知可靠的T2I生成建立了新范式。