Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.
翻译:扩散模型已重振图像生成领域,在学术研究与艺术表达中均发挥着关键作用。随着新型扩散模型的出现,评估文生图模型的性能变得日益重要。现有评估指标主要关注输入文本与生成图像的直接匹配,但由于跨模态信息不对称,这会导致不可靠或不完整的评估结果。受此启发,本研究引入图像再生任务,通过要求文生图模型根据参考图像生成图像来评估其性能。我们使用GPT4V弥合参考图像与文生图模型文本输入之间的鸿沟,使文生图模型能够理解图像内容。该评估流程因生成图像与参考图像的对比变得直接而得以简化。我们构建了涵盖内容多样性与风格多样性的两个再生评估数据集,用于评估当前主流的扩散模型。此外,我们提出ImageRepainter框架,通过多模态大语言模型引导的迭代生成与修订来增强内容理解,从而提升生成图像质量。综合实验证明了该框架在评估模型生成能力方面的有效性。通过利用多模态大语言模型,我们验证了强大的文生图模型能够生成与参考图像更相似的图像。