Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.
翻译:在细粒度视觉分类(FGVC)中,由于对固定提示和全局对齐的依赖,将大型视觉语言模型(VLMs)的知识蒸馏到轻量级网络中至关重要且具有挑战性。为解决此问题,我们提出了PAND(提示感知邻域蒸馏),这是一个将语义校准与结构传递解耦的两阶段框架。首先,我们引入提示感知语义校准来生成自适应的语义锚点。其次,我们提出了一种邻域感知的结构蒸馏策略,以约束学生网络的局部决策结构。PAND在四个FGVC基准测试中始终优于最先进的方法。值得注意的是,我们的ResNet-18学生模型在CUB-200上达到了76.09%的准确率,比强大的基线VL2Lite高出3.4%。代码可在 https://github.com/LLLVTA/PAND 获取。