Improving Zero-shot Generalization and Robustness of Multi-modal Models

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT. The proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code is available at https://github.com/gyhandy/Hierarchy-CLIP.

翻译：多模态图文模型（如CLIP和LiT）在图像分类基准测试中展现出令人瞩目的性能，其零样本泛化能力尤为突出。然而，尽管这些模型的前五准确率极高，但单标签准确率却显著较低（某些情况下差异超过25%）。我们探究了这一性能差距的原因，发现许多失败案例源于文本提示中的歧义。首先，我们开发了一种简单高效的零样本后处理方法，通过衡量预测结果相对于多个提示和图像变换的一致性，来识别单标签预测可能错误的图像。实验表明，该方法能更准确地预测错误，在选择性预测任务中优于主流的最大对数基线方法。其次，我们提出了一种利用WordNet层次结构提高此类不确定图像准确率的简单有效方法：具体地，通过整合语义标签层次结构中的父类与子类来扩展原始类别，并将扩展结果融入文本提示。我们在基于ImageNet的五个不同数据集上对CLIP和LiT模型进行了实验。对于CLIP，该方法在不确定子集上提升了17.13%的单标签准确率，在完整ImageNet验证集上提升了3.6%。在ImageNet迁移数据集、其他四个数据集以及LiT等其他模型架构上，我们的方法同样表现出性能提升。该方法无需超参数调优、无需额外模型训练，且易于扩展至其他大规模多模态架构。代码已开源：https://github.com/gyhandy/Hierarchy-CLIP。