Recent developments in diffusion models have unleashed the astonishing capabilities of text-to-image generation systems to synthesize high-quality images that are faithful to a given reference text, known as a "prompt." These systems, once released to the public, have immediately received tons of attention from researchers, creators, and common users. Despite the plenty of efforts to improve the underneath generative models, there is limited work on understanding the information needs of the real users of these systems, e.g., by investigating the prompts the users input at scale. In this paper, we take the initiative to conduct a comprehensive analysis of large-scale prompt logs collected from multiple text-to-image generation systems. Our work is analogous to analyzing the query log of Web search engines, a line of work that has made critical contributions to the glory of the Web search industry and research. We analyze over two million user-input prompts submitted to three popular text-to-image systems at scale. Compared to Web search queries, text-to-image prompts are significantly longer, often organized into unique structures, and present different categories of information needs. Users tend to make more edits within creation sessions, showing remarkable exploratory patterns. Our findings provide concrete implications on how to improve text-to-image generation systems for creation purposes.
翻译:近期扩散模型的发展释放了文本到图像生成系统的惊人能力,使其能够合成与给定参考文本(称为"提示")高度一致的高质量图像。这些系统一经公开发布便迅速引起研究人员、创作者和普通用户的广泛关注。尽管为改进底层生成模型已投入大量工作,但针对理解真实用户信息需求的研究仍十分有限,例如通过大规模分析用户输入的提示来探究其行为。本文率先对从多个文本到图像生成系统收集的大规模提示日志进行综合分析。我们的工作类似于Web搜索引擎的查询日志分析——这一研究方向曾为网络搜索产业与学术研究的繁荣做出关键贡献。我们系统分析了提交至三个主流文本到图像系统的超过两百万条用户输入提示。与搜索查询相比,文本到图像提示显著更长,常具有独特结构,并呈现不同类别的信息需求。用户在创作会话中更倾向于进行多次编辑修改,展现出显著的探索性行为模式。研究发现为改进面向创作场景的文本到图像生成系统提供了具体启示。