Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation."
翻译:摘要: 文本到图像(T2I)和视觉-语言到图像(VL2I)生成领域近期取得了显著进展。然而,针对广义视觉-语言输入(尤其是涉及多幅图像)的生成任务仍鲜有探索。本文提出Kosmos-G模型,该模型利用多模态大语言模型(MLLM)的先进感知能力解决上述挑战。我们的方法通过文本模态作为锚点,将MLLM的输出空间与CLIP对齐,并在 curated 数据上执行组合式指令调优。Kosmos-G展现出零样本多实体主体驱动的独特生成能力。值得注意的是,分数蒸馏指令调优无需修改图像解码器。这使得模型能够无缝替代CLIP,并无障碍集成从细粒度控制到个性化图像解码器变体的一系列U-Net技术。我们认为Kosmos-G是实现"将图像作为图像生成中的外语"这一目标的初步尝试。