Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address this challenge, we propose the Self Structural Semantic Alignment (S^3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S^3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S^3A method offers substantial improvements over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

翻译：大规模预训练视觉语言模型（VLMs）已被证明在零样本分类任务中有效。然而，现有基于VLMs的方法大多受限于部分源监督或理想化词汇的假设，难以满足开放世界场景需求。本文聚焦于更具挑战性的"真实零样本分类"设定，该设定不依赖任何标注信息，仅使用广泛词汇集。为应对这一挑战，我们提出自结构语义对齐（S^3A）框架，该框架能从无标注数据中提取结构语义信息，同时进行自学习。S^3A框架采用独特的聚类-投票-提示-重对齐（CVPR）算法，通过迭代聚类无标注数据获得结构语义，进而生成伪监督信号。CVPR过程包括：对图像进行迭代聚类、在聚类内投票确定初始候选类别、借助大语言模型生成判别性提示以区分易混淆候选类别，以及将图像与词汇集进行结构语义对齐。最后，我们通过教师-学生学习策略，同时针对个体语义对齐和结构语义对齐对CLIP图像编码器进行自学习。在多种通用及细粒度基准数据集上的综合实验表明，S^3A方法相比现有基于VLMs的方法有显著提升，平均准确率较CLIP提高超过15%。相关代码、模型及提示已公开发布于https://github.com/sheng-eatamath/S3A。