Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation." The code can be found at https://aka.ms/Kosmos-G
翻译:近期,主题驱动图像生成技术取得了显著进展。然而,当前方法在多样化应用场景中仍存在不足,因为它们需要测试时调优,且无法处理交错的多图像与文本输入。这些局限性使其远未达到“图像作为图像生成中的外语”这一终极目标。本文提出Kosmos-G模型,利用多模态大语言模型(MLLM)的先进多模态感知能力来应对上述挑战。我们的方法以文本模态为锚点,将MLLM的输出空间与CLIP对齐,并在精心策划的数据上进行组合式指令微调。Kosmos-G展现了令人印象深刻的零样本主题驱动生成能力,能够处理交错的多图像与文本输入。值得注意的是,分数蒸馏指令微调无需对图像解码器进行任何修改。这使得CLIP可以被无缝替换,并能轻松集成从细粒度控制到个性化图像解码器变体等众多U-Net技术。我们将Kosmos-G视为迈向“图像作为图像生成中的外语”这一目标的初步尝试。代码见https://aka.ms/Kosmos-G。