This paper introduces a novel explainable image quality evaluation approach called X-IQE, which leverages visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations. X-IQE utilizes a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce self-consistent, unbiased texts that are highly correlated with human evaluation. It offers several advantages, including the ability to distinguish between real and generated images, evaluate text-image alignment, and assess image aesthetics without requiring model training or fine-tuning. X-IQE is more cost-effective and efficient compared to human evaluation, while significantly enhancing the transparency and explainability of deep image quality evaluation models. We validate the effectiveness of our method as a benchmark using images generated by prevalent diffusion models. X-IQE demonstrates similar performance to state-of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming the limitations of previous evaluation models on DrawBench, particularly in handling ambiguous generation prompts and text recognition in generated images. Project website: https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models
翻译:本文提出了一种新颖的可解释图像质量评估方法X-IQE,该方法利用视觉大语言模型(LLMs)通过生成文本解释来评估文本到图像生成方法。X-IQE采用层次化思维链(CoT)使MiniGPT-4能够生成与人类评估高度相关的自洽、无偏文本。该方法具有多项优势,包括无需模型训练或微调即可区分真实与生成图像、评估文本-图像对齐以及图像美学质量。与人工评估相比,X-IQE更具成本效益和高效性,同时显著增强了深度图像质量评估模型的透明度和可解释性。我们使用主流扩散模型生成的图像验证了该方法作为基准的有效性。X-IQE在COCO Caption数据集上展示了与最先进(SOTA)评估方法相当的性能,同时克服了先前评估模型在DrawBench上的局限性,特别是在处理模糊生成提示和生成图像中的文本识别方面。项目网站:https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models