Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.
翻译:视觉-语言模型在零样本图像分类任务中展现出强大性能。然而,现有方法(包括对比语言-图像预训练)均依赖带标注的文本-图像对来实现视觉与文本模态的对齐。这种依赖性在准备高质量数据集时引入了显著的成本与精度要求。同时,大多数模型处理双模态数据仍需双塔编码器,这亦阻碍了其轻量化。为应对这些局限,我们提出了“基于大语言模型生成的对比语言-图像预训练”框架。该框架利用大语言模型生成类别特定的提示词,以引导扩散模型合成参考图像。随后,这些生成图像将作为视觉原型,通过提取真实图像的视觉特征并与这些原型的视觉特征进行比较,实现对比预测。通过大语言模型优化提示生成并仅使用视觉编码器,该框架保持了轻量与高效。关键的是,我们的框架在整个实验过程中仅需类别标签作为输入,无需人工标注的图像-文本对及额外预处理。实验结果验证了该框架的可行性与高效性,其在零样本分类任务中表现出优异性能,为分类任务建立了新颖范式。