The visual classification performance of vision-language models such as CLIP can benefit from additional semantic knowledge, e.g. via large language models (LLMs) such as GPT-3. Further extending classnames with LLM-generated class descriptors, e.g. ``waffle, \textit{which has a round shape}'', or averaging retrieval scores over multiple such descriptors, has been shown to improve generalization performance. In this work, we study this behavior in detail and propose \texttt{Waffle}CLIP, a framework for zero-shot visual classification which achieves similar performance gains on a large number of visual classification tasks by simply replacing LLM-generated descriptors with random character and word descriptors \textbf{without} querying external models. We extend these results with an extensive experimental study on the impact and shortcomings of additional semantics introduced via LLM-generated descriptors, and showcase how semantic context is better leveraged by automatically querying LLMs for high-level concepts, while jointly resolving potential class name ambiguities. Link to the codebase: https://github.com/ExplainableML/WaffleCLIP.
翻译:视觉语言模型(如CLIP)的视觉分类性能可受益于额外语义知识,例如通过大型语言模型(如GPT-3)获取。进一步用大语言模型生成的类别描述符(如“华夫饼,具有圆形形状”)扩展类名,或对多个此类描述符的检索分数取平均,已被证明能提升泛化性能。本研究详细探讨了这一行为,并提出\texttt{Waffle}CLIP框架——一种零样本视觉分类方法,其通过简单地将大语言模型生成的描述符替换为随机字符和词语描述符(\textbf{无需}查询外部模型),即可在大量视觉分类任务上取得相近的性能提升。我们通过一项广泛的实验研究,进一步探讨大语言模型生成描述符中额外语义的影响与局限性,并展示如何通过自动查询大语言模型获取高层概念来更有效利用语义上下文,同时联合解决潜在的类名歧义问题。代码库链接:https://github.com/ExplainableML/WaffleCLIP。