CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. That is, CLIP performs poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of pre-defined text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a large margin (~10x in accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art on the supervised-learning setting (88.80% accuracy) but also the first to enable users to edit the class definitions to form a new classifier without retraining. Compared to concept bottleneck models, PEEB is also the state-of-the-art in both zero-shot and supervised learning settings.
翻译:基于CLIP的分类器依赖于包含文本编码器已知的{类名}的提示词。也就是说,CLIP对未见类别或名称在互联网上罕见出现的类别(例如鸟类的学名)表现不佳。针对细粒度分类任务,我们提出PEEB——一种可解释且可编辑的分类器,其核心思路为:(1)将类别名称表达为一组预定义的文本描述符,用以描述该类别的视觉部件特征;(2)将检测到的部件嵌入与各类别中的文本描述符进行匹配,以计算用于分类的logit分数。在类别名称未知的零样本场景下,PEEB的准确率远超CLIP(提升约10倍)。与基于部件的分类器相比,PEEB不仅在监督学习设置下达到最优性能(准确率88.80%),且首次实现用户无需重新训练即可通过编辑类别定义来构建新分类器。相较于概念瓶颈模型,PEEB在零样本与监督学习两种设置下同样达到最优水平。