CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.
翻译:CLIP(对比语言-图像预训练)通过从含噪图文对中进行对比学习,在识别广泛候选类别方面表现出色,但其对广义关联的侧重制约了区分细粒度物体细微差异的精度。相比之下,多模态大语言模型(MLLMs)凭借从网络级语料库预训练中获得的海量知识,在细粒度分类任务中表现优异。然而,随着类别数量增加,MLLMs的性能会因复杂度提升和上下文窗口大小限制而显著下降。为协同两种方法的优势,增强面向大规模细粒度词汇数据集的少样本/零样本识别能力,本文提出RAR——一种面向MLLMs的检索与排序增强方法。我们首先构建基于CLIP的多模态检索器,为当前上下文窗口之外的各类别创建并存储显式记忆。推理时,RAR从记忆中检索top-k最相似结果,并利用MLLMs进行排序以生成最终预测。该方法不仅弥补了细粒度识别的固有局限,还保留了模型的综合知识库,显著提升了多项视觉-语言识别任务的准确率。值得注意的是,在零样本识别场景下,我们的方法在5个细粒度视觉识别基准、11个少样本图像识别数据集和2个目标检测数据集上均实现了显著性能提升。