CRAFT：面向多模态文本到图像生成的连续推理与智能反馈调优 (CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation)

Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

翻译：近期研究表明，推理阶段的思考与反思能够在不重新训练的情况下提升文本到图像生成质量。然而，现有方法通常依赖于隐式的整体性评价或无约束的提示词改写，导致其行为难以解释、控制或可靠终止。相比之下，大语言模型已从基于验证、针对性修正和早期停止的显式结构化**思考**形式中获益。本文提出CRAFT（连续推理与智能反馈调优），一种免训练且模型无关的多模态图像生成框架。CRAFT将用户提示转换为显式的依赖关系结构化视觉约束集合，通过视觉语言模型验证生成图像，并仅在特定约束被违反时执行针对性提示更新。该迭代过程包含显式停止准则，形成可解释且可控的推理阶段优化循环。在多种模型架构和具有挑战性的基准测试中，CRAFT持续提升组合准确性、文本渲染能力和基于偏好的评估效果，对轻量级生成器的改进尤为显著。重要的是，这些改进仅带来可忽略的推理时间开销，使得较小或较经济的模型能够接近昂贵系统的生成质量。我们的结果表明，显式结构化、约束驱动的推理阶段思考是提升多模态生成模型可靠性的关键要素。