Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting

Recent advancements in image inpainting, particularly through diffusion modeling, have yielded promising outcomes. However, when tested in scenarios involving the completion of images based on the foreground objects, current methods that aim to inpaint an image in an end-to-end manner encounter challenges such as "over-imagination", inconsistency between foreground and background, and limited diversity. In response, we introduce Anywhere, a pioneering multi-agent framework designed to address these issues. Anywhere utilizes a sophisticated pipeline framework comprising various agents such as Visual Language Model (VLM), Large Language Model (LLM), and image generation models. This framework consists of three principal components: the prompt generation module, the image generation module, and the outcome analyzer. The prompt generation module conducts a semantic analysis of the input foreground image, leveraging VLM to predict relevant language descriptions and LLM to recommend optimal language prompts. In the image generation module, we employ a text-guided canny-to-image generation model to create a template image based on the edge map of the foreground image and language prompts, and an image refiner to produce the outcome by blending the input foreground and the template image. The outcome analyzer employs VLM to evaluate image content rationality, aesthetic score, and foreground-background relevance, triggering prompt and image regeneration as needed. Extensive experiments demonstrate that our Anywhere framework excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity. It successfully elevates foreground-conditioned image inpainting to produce more reliable and diverse results.

翻译：近期图像修复技术，尤其是基于扩散建模的进展，取得了显著成果。然而，在基于前景对象完成图像的测试场景中，当前以端到端方式修复图像的方法面临"过度想象"、前景与背景不一致以及多样性受限等挑战。为此，我们提出"无处不在"——一个创新的多智能体框架以解决上述问题。该框架采用由视觉语言模型、大语言模型和图像生成模型等多类智能体组成的精密流水线架构，包含三大核心模块：提示生成模块、图像生成模块与结果分析模块。提示生成模块通过视觉语言模型对输入前景图像进行语义分析以预测相关语言描述，并借助大语言模型推荐最优语言提示。在图像生成模块中，我们采用文本引导的Canny边缘条件图像生成模型，基于前景图像的边缘图与语言提示生成模板图像，再通过图像精炼器融合输入前景与模板图像生成最终结果。结果分析模块利用视觉语言模型评估图像内容合理性、美学评分及前景-背景相关性，并在必要时触发提示与图像的重新生成。大量实验表明，"无处不在"框架在前景约束图像修复任务中表现卓越，有效抑制了"过度想象"，解决了前景-背景不一致问题，并增强了多样性，成功将前景约束图像修复提升至更可靠且多样化的水平。