Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.
翻译:视觉-语言模型使得无需任何重新训练即可对物体进行开放世界分类。尽管这种零样本范式标志着重大进步,但即便是当今最先进的模型,在物体与其典型描述不相同时仍表现出偏斜的性能。现实世界中的物体如梨子会以多种形式出现——从切块到完整、在桌上或碗中——而标准的VLM分类器将所有类别的实例映射到\it{基于类别标签的单个向量}。我们认为,为体现类别内部的丰富多样性,零样本分类应超越单向量方法。我们提出了一种方法,利用推断属性在零样本设置下编码并考虑类别内的多样性,且无需重新训练。我们发现,在涵盖层级结构、多样化物体状态、真实地理多样性以及类内多样性可能较不显著的精粒度数据集等大规模数据集集合上,我们的方法始终优于标准零样本分类。重要的是,我们的方法具有内在可解释性,为每次推理提供忠实的解释,以促进模型调试并提升透明度。我们还发现,我们的方法能高效扩展到大量属性以考虑多样性——从而对非典型实例实现更准确的预测。最后,我们刻画了总体准确率与最差类别准确率之间的原则性权衡,可通过我们方法的超参数进行调节。我们希望这项工作能推动超越单类别向量的零样本分类研究,以捕捉世界多样性,并构建不牺牲性能的透明人工智能系统。