Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (rho=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.

翻译：尽管文本到图像生成模型取得了令人瞩目的成功，现有研究却忽视了这些模型是否准确传达事实信息的问题。本文聚焦于图像幻觉问题，即生成模型创建的图像未能忠实描绘事实内容。为此，我们提出了I-HallA（基于问答的图像幻觉评估），这是一种通过视觉问答衡量生成图像事实性的新型自动化评估指标。同时，我们构建了用于此目的的基准数据集I-HallA v1.0。在此过程中，我们开发了一个基于多智能体GPT-4 Omni框架的流水线，通过人工校验确保准确性，以生成高质量的问-答对。我们的评估协议通过测试现有文本到图像模型生成的图像能否正确回答这些问题来衡量图像幻觉程度。I-HallA v1.0数据集包含九个类别中1.2K个多样化图文对，以及涵盖各类组合挑战的1000个严格筛选的问题。我们使用I-HallA评估了五种文本到图像模型，发现这些前沿模型往往无法准确传达事实信息。此外，我们通过证明该指标与人工评判具有强斯皮尔曼相关性（ρ=0.95）验证了其可靠性。我们相信该基准数据集与评估指标能为开发具有事实准确性的文本到图像生成模型奠定基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/