This paper introduces a novel explainable image quality evaluation approach called X-IQE, which leverages visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations. X-IQE utilizes a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce self-consistent, unbiased texts that are highly correlated with human evaluation. It offers several advantages, including the ability to distinguish between real and generated images, evaluate text-image alignment, and assess image aesthetics without requiring model training or fine-tuning. X-IQE is more cost-effective and efficient compared to human evaluation, while significantly enhancing the transparency and explainability of deep image quality evaluation models. We validate the effectiveness of our method as a benchmark using images generated by prevalent diffusion models. X-IQE demonstrates similar performance to state-of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming the limitations of previous evaluation models on DrawBench, particularly in handling ambiguous generation prompts and text recognition in generated images. Project website: https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models
翻译:本文提出了一种名为X-IQE的新型可解释图像质量评估方法,该方法利用视觉大语言模型(LLMs)通过生成文本解释来评估文本生成图像方法。X-IQE采用分层思维链(CoT)机制,使MiniGPT-4能够生成与人类评估高度相关的自洽、无偏文本。该方法具有多项优势,包括无需模型训练或微调即可区分真实图像与生成图像、评估文本-图像对齐程度以及评估图像美学质量。与人工评估相比,X-IQE更具成本效益和效率,同时显著提升了深度图像质量评估模型的透明度和可解释性。我们采用主流扩散模型生成的图像验证了该方法作为基准的有效性。在COCO Caption数据集上,X-IQE展现了与最先进(SOTA)评估方法相当的性能,同时克服了以往评估模型在DrawBench上的局限,特别是在处理模糊生成提示和生成图像中文本识别方面。项目网站:https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models