Commonsense reasoning, the ability to make logical assumptions about daily scenes, is one core intelligence of human beings. In this work, we present a novel task and dataset for evaluating the ability of text-to-image generative models to conduct commonsense reasoning, which we call PAINTaboo. Given a description with few visual clues of one object, the goal is to generate images illustrating the object correctly. The dataset was carefully hand-curated and covered diverse object categories to analyze model performance comprehensively. Our investigation of several prevalent text-to-image generative models reveals that these models are not proficient in commonsense reasoning, as anticipated. We trust that PAINTaboo can improve our understanding of the reasoning abilities of text-to-image generative models.
翻译:常识推理——即对日常场景做出逻辑假设的能力——是人类的核心智能之一。在本工作中,我们提出了一项名为PAINTaboo的新任务和数据集,用于评估文本到图像生成模型进行常识推理的能力。给定对某个物体包含少量视觉线索的描述,目标是生成能够正确描绘该物体的图像。该数据集经过精心手工筛选,覆盖了多样化的物体类别,以全面分析模型性能。我们对几种主流文本到图像生成模型的调查揭示,这些模型在常识推理方面并不熟练,这与预期相符。我们相信PAINTaboo能够增进我们对文本到图像生成模型推理能力的理解。