Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
翻译:近年来,图像生成模型在生成高保真度与逼真图像方面展现出强大能力。然而,受限于冻结的内部知识,它们在处理需要专业知识或实时信息的真实场景时往往表现不佳。本文提出Gen-Searcher,首次尝试训练一种搜索增强的图像生成代理,该代理通过多跳推理与搜索操作,收集用于图文生成所需的文本知识与参考图像。为此,我们构建了专门的数据处理流水线,并精心整理了Gen-Searcher-SFT-10k与Gen-Searcher-RL-6k两个高质量数据集,包含多样化的搜索密集型提示及其对应的真实合成图像。此外,我们介绍了KnowGen——一个综合性基准测试集,该基准明确要求图像生成需依赖于搜索获取的外部知识,并从多个维度评估模型性能。基于上述资源,我们采用监督微调(SFT)结合基于双重奖励反馈的代理强化学习训练Gen-Searcher,其中文本奖励与图像奖励共同为GRPO训练提供更稳定且更具信息量的学习信号。实验结果表明,Gen-Searcher带来显著性能提升:在KnowGen与WISE基准上分别将Qwen-Image的得分提高约16分与15分。我们期望此项工作能为图像生成中的搜索代理提供开放基础,并已将全部数据、模型及代码开源。