The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git
翻译:为生成和编辑视觉图像提供细粒度控制的能力,对计算机视觉及其应用具有深远意义。已有研究通过两个方向探索扩展可控性:基于文本提示的指令微调与多模态条件控制。然而,这些方法在表达可控性时所使用的模态输入数量及/或类型上,存在一个或多个不自然的假设。我们提出InstructAny2Pix——一个灵活的多模态指令跟随系统,使用户能够通过包含音频、图像和文本的指令对输入图像进行编辑。InstructAny2Pix由三个核心模块构成:多模态编码器,将图像与音频等不同模态编码为统一潜在空间;扩散模型,学习将该潜在空间中的表征解码为图像;以及多模态大语言模型(LLM),可理解包含多张图像和音频片段的指令,并生成期望输出的条件嵌入,供扩散解码器使用。此外,为提升训练效率与生成质量,我们引入额外的精化前置模块,以增强LLM输出的视觉质量。这些设计对我们的系统性能至关重要。实验表明,本系统可执行一系列新颖的指令引导编辑任务。代码发布于https://github.com/jacklishufan/InstructAny2Pix.git