Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.
翻译:时间会改变我们世界中实体(如物体、地点和动物)的视觉外观。因此,为了准确生成与上下文相关的图像,关于时间的知识与推理可能至关重要(例如,生成春季景观与冬季景观)。然而,尽管在自然语言处理领域已有大量关于理解和改进时序知识的研究,但关于时序现象如何在文本到图像(T2I)模型中呈现和处理的研究仍然匮乏。我们通过TempViz填补这一空白,这是首个全面评估图像生成中时序知识的数据集,包含7.9k个提示词和600多张参考图像。利用TempViz,我们研究了五种T2I模型在五个时序知识类别上的能力。人工评估表明,时序能力普遍较弱,所有模型在各类别上的准确率均未超过75%。为推进更大规模的研究,我们还检验了自动化评估方法,将几种成熟方法与人工判断进行比较。然而,这些方法均未提供对时序线索的可靠评估——这进一步表明未来亟需开展T2I时序知识的研究。