Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model's ``preferences'', producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model's preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.
翻译:使用文本到图像模型生成图像通常需要多次尝试,用户会基于输出图像这一反馈反复更新提示。受认知科学中参考游戏和对话对齐研究的启发,本文分析了用户提示在这些迭代过程中的动力学特征。我们构建了用户与Midjourney交互的迭代数据集。分析表明,提示词在迭代过程中会收敛于特定特征。我们进一步探究这种收敛性源于用户意识到遗漏了重要细节,还是源于用户适应模型"偏好"(即通过特定语言风格生成更优图像)。初步证据显示两种机制同时存在。用户适应模型偏好的可能性引发了对复用用户数据进行模型训练的担忧——提示词可能偏向特定模型的偏好,而非契合人类意图与自然表达方式。