Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.
翻译:计算机视觉中的少样本细粒度分类任务因需在有限数据下区分细微的类别差异而面临重大挑战。本文提出一种创新方法,通过视觉输入实时引导的自适应提示调优来增强对比语言-图像预训练(CLIP)模型。与上下文优化(CoOp)和视觉提示调优(VPT)等受限于静态提示或视觉令牌依赖的现有技术不同,本方法利用交叉注意力机制动态优化针对当前图像的文本提示。这使得文本特征能够与视觉Transformer提取的图像块实现图像特异性对齐,从而提升模型在类内方差大、类间差异小的数据集上的性能。该方法在CUBirds、Oxford Flowers和FGVC Aircraft等多个数据集上进行评估,结果显示其性能显著优于静态提示调优方法。为确保性能提升转化为可信预测,我们在方法中集成蒙特卡洛Dropout以增强模型预测及不确定性估计的可靠性。该集成机制为模型预测置信度提供了重要参考,有助于判断何时可采信预测结果、何时需要额外验证。这种动态方法为少样本细粒度分类提供了鲁棒的解决方案,推动了该领域前沿技术的发展。