Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
翻译:文本到图像生成模型能够生成高质量图像,这些图像通常能忠实呈现自然语言描述的概念。在本研究中,我们系统评估了一系列文本到图像模型在不同难度数值推理任务上的表现,结果表明即使最先进的模型也仅具备初级的数值能力。具体而言,模型在图像中正确生成指定数量物体的能力仅限于较小数值,且高度依赖于数字术语出现的语境,其性能随着数字增大而快速下降。我们还发现模型对语言量化词(如"几个"或"多达")的理解能力较弱,对零的概念认知不足,且在处理部分数量、分数表示等进阶概念时存在明显困难。我们将提示词、生成图像及人工标注整合为GeckoNum——一个用于评估数值推理能力的新型基准测试集。