As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose to a new metric the study the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.
翻译:随着生成式人工智能的日益普及,研究人类用户如何与这类模型进行交互变得十分重要。本文中,我们探究了人们如何使用文本到图像模型来生成期望的目标图像。为了研究这种交互行为,我们创建了ArtWhisperer——一个在线游戏,用户在该游戏中获得一个目标图像,并需通过迭代的方式寻找一个提示词,以生成与目标图像相似的图像。通过该游戏,我们记录下超过50,000次人机交互;每次交互对应一个由用户创作的文本提示词及其生成的相应图像。其中绝大多数为重复交互,用户通过迭代的方式为其目标图像寻找最佳提示词,这使得本数据集成为研究人机协作的独特序列数据集。在对该数据集的初步分析中,我们识别出提示词交互与用户策略的若干特征。人们提交了多样化的提示词,并能够发现多种生成相似图像的文本描述。有趣的是,随着用户找到更优的提示词,提示词的多样性并未降低。此外,我们提出一种新指标,用于利用本数据集研究AI的可操控性。我们将可操控性定义为充分完成一项任务所需的预期交互次数。我们通过为每个目标任务拟合一条马尔可夫链,并计算在该马尔可夫链中达到满意分数所需的预期时间,来估计该值。我们量化并比较了不同类型目标图像及两个不同模型下的AI可操控性,发现城市图像和自然世界图像比艺术图像和幻想图像具有更高的可操控性。这些发现为理解人机交互行为提供了洞见,提出了一种评估AI可操控性的具体方法,并展示了ArtWhisperer数据集的通用实用价值。