GMAIL: Generative Modality Alignment for generated Image Learning

Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.

翻译：生成模型已能合成高度逼真的图像，这为训练机器学习模型提供了潜在丰富的数据源。尽管这些可合成数据源具有优势，但若不加区分地将生成图像作为真实图像用于训练，甚至可能因真实域与合成域之间的模态差异而导致模式崩溃。本文提出一种新颖的生成图像判别性使用框架GMAIL，该框架明确将生成图像视为与真实图像分离的独立模态。我们的方法并非在像素空间中不加区分地用生成图像替代真实图像，而是通过多模态学习方法在同一潜在空间中桥接这两种不同模态。具体而言，我们首先使用跨模态对齐损失在生成图像上对模型进行微调，随后利用该对齐模型进一步结合生成图像训练多种视觉-语言模型。通过对齐两种模态，我们的方法有效利用了生成模型最新进展的优势，从而提升生成图像学习在多种视觉-语言任务中的有效性。该框架可轻松集成到各类视觉-语言模型中，我们通过大量实验验证了其效能。例如，该框架在图像描述生成、零样本图像检索、零样本图像分类及长文本检索任务中均显著提升性能，同时展现出积极的生成数据扩展趋势，并在大型多模态模型LLaVA的描述生成性能上实现显著增强。