Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings.
翻译:现有视觉-语言模型在多种视觉领域和任务上展现出强大的泛化能力。然而,这类模型主要执行封闭集形式的零样本识别,因此难以处理开放域的视觉概念。近期如提示学习等微调方法,不仅研究了分布内样本与分布外样本的判别能力,还在分布内与分布外准确率上取得了改进。本文首先证明,视觉-语言模型在未经恰当正则化且进行长时间微调后,会倾向于过拟合给定数据集的已知类别,导致未知类别性能下降。随后我们提出一种名为OGEN的新方法以应对这一陷阱,其主要目标是提升微调模型的分布外泛化能力。具体而言,我们引入类条件特征生成器,仅利用任意未知类别的类名即可合成分布外特征。这些合成特征将为未知类别提供有效知识,并在联合优化过程中帮助正则化分布内与分布外数据的决策边界。同等重要的是我们提出的自适应自蒸馏机制,可在联合优化期间正则化特征生成模型——即通过自适应地在模型状态间传递知识以进一步防止过拟合。实验结果表明,我们的方法在不同设定下均能显著提升分布外泛化性能。