Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation." The code can be found at https://aka.ms/Kosmos-G
翻译:近期,主题驱动图像生成领域取得了显著进展。然而,当前方法在多样化应用场景中仍存在不足,因为它们需要测试时调参,且无法处理交错的图文输入。这些局限使其与"图像生成中图像即外语"的终极目标相去甚远。本文提出Kosmos-G模型,该模型充分利用多模态大型语言模型(MLLM)的高级多模态感知能力来应对上述挑战。我们的方法以文本模态为锚点,将MLLM的输出空间与CLIP对齐,并对精选数据执行组合式指令微调。Kosmos-G展现出令人印象深刻的能力:无需任何微调即可基于交错的图文输入实现零样本主题驱动生成。值得注意的是,分数蒸馏指令微调不需要对图像解码器进行任何修改。这使我们可以无缝替换CLIP,并轻松整合从细粒度控制到个性化图像解码器变体等大量U-Net技术。我们将Kosmos-G视为迈向"图像生成中图像即外语"目标的初步尝试。代码请见https://aka.ms/Kosmos-G