UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Recent advances in vision-language pre-training have enabled machines to perform better in multimodal object discrimination (e.g., image-text semantic alignment) and image synthesis (e.g., text-to-image generation). On the other hand, fine-tuning pre-trained models with discriminative or generative capabilities such as CLIP and Stable Diffusion on domain-specific datasets has shown to be effective in various tasks by adapting to specific domains. However, few studies have explored the possibility of learning both discriminative and generative capabilities and leveraging their synergistic effects to create a powerful and personalized multimodal model during fine-tuning. This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC). UniDiff effectively learns aligned semantics and mitigates the issue of semantic collapse during fine-tuning on small datasets by leveraging RSC on visual features from CLIP and diffusion models, without altering the pre-trained model's basic architecture. UniDiff demonstrates versatility in both multi-modal understanding and generative tasks. Experimental results on three datasets (Fashion-man, Fashion-woman, and E-commercial Product) showcase substantial enhancements in vision-language retrieval and text-to-image generation, illustrating the advantages of combining discriminative and generative fine-tuning. The proposed UniDiff model establishes a robust pipeline for personalized modeling and serves as a benchmark for future comparisons in the field.

翻译：近期视觉-语言预训练的进展使机器在多模态对象判别（如图像-文本语义对齐）和图像合成（如文本生成图像）方面展现出更优性能。另一方面，在领域特定数据集上对CLIP、Stable Diffusion等具备判别或生成能力的预训练模型进行微调，已被证明能通过适配特定领域有效提升各类任务表现。然而，鲜有研究探索在微调过程中同步学习判别与生成能力，并利用二者的协同效应构建强大且个性化的多模态模型。本文提出UniDiff——一个统一的多模态模型，整合了图像-文本对比学习（ITC）、文本条件图像合成学习（IS）以及互逆语义一致性建模（RSC）。UniDiff在不改变预训练模型基础架构的前提下，通过利用CLIP与扩散模型的视觉特征进行RSC，有效学习对齐语义并缓解小规模数据集微调时的语义崩塌问题。该模型在多模态理解与生成任务中展现出广泛的适用性。在三个数据集（Fashion-man、Fashion-woman和E-commercial Product）上的实验结果表明，视觉-语言检索与文本生成图像任务均获得显著提升，充分体现了判别与生成联合微调的优势。所提出的UniDiff模型为个性化建模建立了稳健的框架，并可作为该领域未来对比研究的基准。