Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
翻译:大型多模态模型(LMMs)在生成逼真且符合提示的图像方面取得了显著进展,但其输出结果常与可验证知识相矛盾,尤其是在提示涉及细粒度属性或时效性事件时。传统的检索增强方法试图通过引入外部信息来解决这一问题,但由于其依赖静态知识源和浅层证据整合,本质上无法将生成过程锚定于准确且动态演进的知识。为弥补这一差距,我们提出了ORIG——一个用于事实图像生成(FIG)的智能开放多模态检索增强框架,该新任务同时要求视觉真实性与事实依据。ORIG通过迭代方式从网络检索并筛选多模态证据,并将精炼知识逐步整合至增强提示中以引导生成过程。为支持系统化评估,我们构建了FIG-Eval基准数据集,涵盖感知、组合与时间维度共十个类别。实验表明,ORIG在事实一致性与整体图像质量上均显著优于现有强基线方法,凸显了开放多模态检索在事实图像生成领域的潜力。