The zero-shot open-vocabulary challenge in image classification is tackled by pretrained vision-language models like CLIP, which benefit from incorporating class-specific knowledge from large language models (LLMs) like ChatGPT. However, biases in CLIP lead to similar descriptions for distinct but related classes, prompting our novel image classification framework via hierarchical comparisons: using LLMs to recursively group classes into hierarchies and classifying images by comparing image-text embeddings at each hierarchy level, resulting in an intuitive, effective, and explainable approach.
翻译:零样本开放词汇图像分类的挑战由预训练的视觉-语言模型如CLIP所应对,这类模型受益于融入大型语言模型(如ChatGPT)的类别特定知识。然而,CLIP中的偏差导致对相似但不同类别的描述趋同,这促使我们提出一种基于层次化比较的新型图像分类框架:利用LLM将类别递归分组为层次结构,并通过在每个层次级别上比较图像-文本嵌入来进行分类,从而形成一种直观、有效且可解释的方法。