Recent developments in large language models (LLM) and generative AI have unleashed the astonishing capabilities of text-to-image generation systems to synthesize high-quality images that are faithful to a given reference text, known as a "prompt". These systems have immediately received lots of attention from researchers, creators, and common users. Despite the plenty of efforts to improve the generative models, there is limited work on understanding the information needs of the users of these systems at scale. We conduct the first comprehensive analysis of large-scale prompt logs collected from multiple text-to-image generation systems. Our work is analogous to analyzing the query logs of Web search engines, a line of work that has made critical contributions to the glory of the Web search industry and research. Compared with Web search queries, text-to-image prompts are significantly longer, often organized into special structures that consist of the subject, form, and intent of the generation tasks and present unique categories of information needs. Users make more edits within creation sessions, which present remarkable exploratory patterns. There is also a considerable gap between the user-input prompts and the captions of the images included in the open training data of the generative models. Our findings provide concrete implications on how to improve text-to-image generation systems for creation purposes.
翻译:大型语言模型(LLM)与生成式人工智能的最新发展,使文本到图像生成系统展现出惊人的能力——能够合成与给定参考文本(即“提示”)高度一致的高质量图像。这些系统迅速获得了研究人员、创作者和普通用户的大量关注。尽管在改进生成模型方面已有诸多努力,但对于大规模用户在这些系统中的信息需求理解仍十分有限。我们首次对来自多个文本到图像生成系统的大规模提示日志进行了全面分析。这项研究类似于对网络搜索引擎查询日志的分析——这一研究方向曾为网络搜索产业与学术研究的辉煌做出关键贡献。与网络搜索查询相比,文本到图像的提示显著更长,且常被组织成由生成任务的主体、形式和意图构成的特殊结构,呈现出独特的信息需求类别。用户在创作会话中进行更多编辑操作,展现出显著的探索性模式。此外,用户输入的提示与生成模型公开训练数据中图像的说明文字之间存在显著差距。我们的发现为改进面向创作目标的文本到图像生成系统提供了具体启示。