Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: https://github.com/google-deepmind/proactive_t2i_agents.

翻译：生成式AI模型的用户提示往往存在描述不充分的问题，导致模型响应效果欠佳。这一问题在文本到图像（T2I）生成任务中尤为突出，用户通常难以精确表达其真实意图。用户设想与模型理解之间的脱节常迫使用户进行耗时且反复的提示词优化。为此，我们提出一种主动式T2I智能体的设计方案，该智能体配备的交互界面能够：（1）在不确定时主动提出澄清性问题；（2）将其对用户意图的理解以可编辑的信念图形式呈现。我们构建了此类智能体的简易原型，并通过人工实验与自动化评估验证其有效性。实验表明，至少90%的人类受试者认为该智能体及其信念图对其T2I工作流程具有助益。此外，我们开发了一种可扩展的自动化评估方法：设置两个智能体，其中一个持有真实目标图像，另一个则尝试通过最少提问次数实现与目标图像的对齐。在我们为艺术家和设计师创建的DesignBench基准、COCO数据集（Lin等人，2014）以及ImageInWords数据集（Garg等人，2024）上的实验显示，这些T2I智能体能够提出信息量丰富的问题并获取关键信息，成功实现图像对齐，其VQAScore（Lin等人，2024）至少达到标准单轮T2I生成方法的2倍以上。演示地址：https://github.com/google-deepmind/proactive_t2i_agents。