UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

翻译：小样本细粒度视觉分类（FGVC）旨在利用有限数据使模型能够区分细微差异的类别。现有研究大多通过微调预训练的视觉语言模型来提升性能，但存在过拟合和泛化能力弱的问题。为解决此问题，我们提出了UniFGVC，一个通用的免训练框架，将小样本FGVC重新定义为多模态检索任务。首先，我们提出类别判别式视觉描述生成器（CDV-Captioner），利用多模态大语言模型（MLLMs）的开放世界知识，生成结构化文本描述以捕捉区分相近类别的细粒度属性特征。CDV-Captioner采用思维链提示和视觉相似参考图像来减少幻觉并增强生成描述的判别性。通过该模块，我们可以将每张图像转换为图像-描述对，实现更全面的特征表示，并利用小样本样本构建多模态类别模板以用于后续检索流程。随后，使用现成的视觉与文本编码器对查询样本和模板对进行嵌入，通过在联合空间中检索最近邻模板完成FGVC任务。UniFGVC确保与多种MLLMs和编码器的广泛兼容性，为不同小样本FGVC场景提供可靠的泛化能力和适应性。在12个FGVC基准数据集上的大量实验表明，该方法始终优于先前基于小样本CLIP的方法，甚至超越了若干基于MLLMs的全监督方法。