UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.

翻译：本文提出UniLIP，一种适配CLIP以实现多模态理解、生成与编辑的统一框架。尽管CLIP在理解任务上表现出色，但其缺乏作为统一视觉编码器所需的重建能力。然而，先前基于CLIP的统一方法未能平衡理解与重建，导致语义退化或重建结果不一致。相比之下，我们引入了一种新颖的两阶段训练方案，结合自蒸馏策略，在保持CLIP原有理解性能的同时，逐步赋予其高保真重建能力。为增强生成与编辑任务中的推理能力和一致性，我们进一步基于MetaQuery框架开发了一种双条件架构。该架构联合利用多模态隐藏状态以获取丰富的上下文细节，并通过可学习的查询嵌入来利用多模态大语言模型（MLLMs）强大的推理能力。凭借先进的图像表示和架构设计，UniLIP展现出卓越的指令跟随能力和编辑保真度。仅使用10亿和30亿参数，UniLIP即可超越BAGEL（70亿）和Uniworld-V1（120亿）等更大规模的统一模型，在GenEval上达到0.90，在WISE上达到0.63，在ImgEdit上达到3.94的先进性能。这些结果表明，UniLIP成功拓展了CLIP的应用范围，使其连续特征不仅成为理解任务的最佳选择，同时在生成与编辑任务中也实现了极具竞争力的性能。代码与模型发布于https://github.com/nnnth/UniLIP。