The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.
翻译:文本到图像生成的实际应用已从简单的单一模型发展为结合多个专用组件的复杂工作流。虽然基于工作流的方法能够提升图像质量,但由于可用组件数量庞大、组件间依赖关系复杂且其性能依赖于生成提示,设计有效的工作流需要大量专业知识。本文提出了提示自适应工作流生成这一新任务,其目标是根据每个用户提示自动定制工作流。我们提出了两种基于大语言模型的方法来解决该任务:一种基于调优的方法从用户偏好数据中学习,另一种免训练方法利用大语言模型选择现有流程。与单一模型或通用的、与提示无关的工作流相比,这两种方法均能提升图像质量。我们的研究表明,提示依赖的流程预测为提升文本到图像生成质量提供了新途径,是对该领域现有研究方向的重要补充。