Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.
翻译:摘要:从大型视觉-语言模型(VLM)中向轻量级网络蒸馏知识,在细粒度视觉分类(FGVC)任务中至关重要但极具挑战性,其主要难点在于对固定提示和全局对齐的依赖。为此,我们提出PAND(提示感知邻域蒸馏)——一种将语义校准与结构迁移解耦的两阶段框架。首先,引入提示感知语义校准机制生成自适应语义锚点。其次,设计邻域感知结构蒸馏策略约束学生网络的局部决策结构。在四个FGVC基准数据集上,PAND持续优于现有最优方法。值得注意的是,采用ResNet-18作为学生网络时,在CUB-200数据集上达到76.09%的准确率,超越强基线模型VL2Lite达3.4%。代码开源地址:https://github.com/LLLVTA/PAND