Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the most damaging confusions are query-specific and shift with support-set geometry. We introduce SCAN (Selective Confusion-Aware Negatives), a framework that addresses this gap through three targeted contributions. In inference, query-adaptive negative routing restricts suppression to the top-K most confusable classes per query, requiring zero additional parameters. Generic negative text templates are replaced with LLM-bootstrapped contrastive prompts that describe discriminative attributes between confusable class pairs, sharpening the textual decision boundary where it matters most. A parameter-free adaptive fusion weight estimated from support-set Fisher discriminability removes the need for manual tuning of the vision-language trade-off. Evaluated across 11 standard benchmarks, SCAN consistently outperforms prior prompt-based and adapter-based methods by an average of 4.61% at 16-shot, with gains of up to 7.70% on fine-grained datasets where inter-class confusion is most severe. SCAN also generalizes strongly under distribution shift, improving by 2.95% on average across four ImageNet OOD variants, and maintains robust performance under significant label noise, with accuracy under 50% label corruption still exceeding the clean baseline of the strongest competing method.
翻译:视觉-语言模型的小样本自适应在推理阶段受限于负类信号处理方式这一根本性问题。现有方法对所有查询采用统一的负类抑制策略,忽视了最具破坏性的混淆是查询特定的,且会随支持集几何结构变化。本文提出SCAN(选择性混淆感知负类)框架,通过三项针对性贡献解决该问题:在推理阶段,查询自适应负类路由将抑制范围限制为每查询最易混淆的前K个类别,无需额外参数;用大语言模型引导的对比提示替换通用负类文本模板,描述易混淆类别间的判别性属性,在关键位置锐化文本决策边界;基于支持集Fisher可判别性估计的无参数自适应融合权重,免除了视觉-语言权衡的手动调参。在11个标准基准测试中,SCAN在16样本场景下平均超越先前基于提示和适配器的方法4.61%,在类间混淆最严重的细粒度数据集上提升达7.70%。SCAN在分布偏移下仍保持强泛化性,在四个ImageNet分布外变体上平均提升2.95%,并在显著标签噪声下保持稳健性能——即使50%标签被污染时,准确率仍超过最强竞争方法的干净基线。