This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
翻译:本文提出Instruct-Imagen模型,该模型能够处理异构图像生成任务并泛化至未见任务。我们引入用于图像生成的*多模态指令*(multi-modal instruction),这是一种能够精确表达多种生成意图的任务表示方法。该方法利用自然语言整合不同模态(如文本、边缘图、风格、主体等),使得丰富的生成意图能够以统一格式标准化。随后,我们通过两阶段框架微调预训练的文本到图像扩散模型构建Instruct-Imagen。首先,采用检索增强训练(retrieval-augmented training)适配模型,增强其基于外部多模态上下文进行生成的能力。接着,在需要视觉语言理解的各类图像生成任务(如主体驱动生成等)上对适配模型进行微调,每个任务均配有封装其本质的多模态指令。在多个图像生成数据集上进行的人工评估表明,Instruct-Imagen在领域内任务上达到或超越先前专用模型,并在未见及更复杂任务上展现出令人期待的泛化能力。