This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models' abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.
翻译:本文探讨了在零样本设置下使用文本到图像模型为分类学概念生成图像的可行性。尽管基于文本的分类学丰富方法已较为成熟,但视觉维度的潜力仍未得到探索。为此,我们提出了一个全面的分类学图像生成基准,用于评估模型理解分类学概念并生成相关、高质量图像的能力。该基准包含了常识性和随机采样的WordNet概念,以及LLM生成的预测。我们使用9个新颖的、与分类学相关的文本到图像指标以及人类反馈对12个模型进行了评估。此外,我们率先将基于GPT-4反馈的成对评估方法用于图像生成。实验结果表明,模型的排名与标准的T2I任务存在显著差异。Playground-v2和FLUX在所有指标和子集上均表现优异,而基于检索的方法表现不佳。这些发现凸显了自动化构建结构化数据资源的潜力。