Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts, enabling zero-shot recognition on downstream tasks. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, even though the overall performance has achieved 64.1\%. This phenomenon reveals the potential risks associated with using CLIP models, particularly in risk-sensitive applications where specific categories hold significant importance. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise Matching Margin (\cmm) to measure the inference confusion. \cmm\ can effectively identify the worst-performing categories and estimate the potential performance of the candidate prompts. We further query large language models to enrich descriptions of worst-performing categories and build a weighted ensemble to highlight the efficient prompts. Experimental results clearly verify the effectiveness of our proposal, where the accuracy on the worst-10 categories on ImageNet is boosted to 5.2\%, without manual prompt engineering, laborious optimization, or access to labeled validation data.
翻译:对比语言-图像预训练(CLIP)通过将自然语言融入视觉概念提供基础模型,支持下游任务的零样本识别。通常预期通过精心设计的文本提示能在多个领域实现令人满意的整体准确率。然而,我们发现CLIP模型在表现最差类别上的性能显著低于整体性能。例如,在ImageNet上,尽管整体准确率已达64.1%,仍有10个类别的类别级准确率低至0%。这一现象揭示了使用CLIP模型的潜在风险,尤其是在特定类别至关重要的风险敏感应用中。为解决该问题,我们研究了CLIP模型中两种模态的对齐性,并提出类别级匹配裕度(\cmm)来衡量推理混淆度。\cmm能有效识别表现最差的类别,并评估候选提示的潜在性能。我们进一步查询大型语言模型以丰富表现最差类别的描述,并构建加权集成以突出高效提示。实验结果清晰验证了本方案的有效性:在无需手动提示工程、繁琐优化或访问标注验证数据的情况下,ImageNet上最差10个类别的准确率提升至5.2%。