Foundation models (e.g., CLIP or DINOv2) have shown their impressive learning and transferring capabilities on a wide range of visual tasks, by training on a large corpus of data and adapting to specific downstream tasks. It is, however, interesting that foundation models have not been fully explored for universal domain adaptation (UniDA), which is to learn models using labeled data in a source domain and unlabeled data in a target one, such that the learned models can successfully adapt to the target data. In this paper, we make comprehensive empirical studies of state-of-the-art UniDA methods using foundation models. We first demonstrate that, while foundation models greatly improve the performance of the baseline methods that train the models on the source data alone, existing UniDA methods generally fail to improve over the baseline. This suggests that new research efforts are very necessary for UniDA using foundation models. To this end, we propose a very simple method of target data distillation on the CLIP model, and achieves consistent improvement over the baseline across all the UniDA benchmarks. Our studies are under a newly proposed evaluation metric of universal classification rate (UCR), which is threshold- and ratio-free and addresses the threshold-sensitive issue encountered when using the existing H-score metric.
翻译:基础模型(如CLIP或DINOv2)通过在大规模数据上训练并适配特定下游任务,已在多种视觉任务中展现出卓越的学习与迁移能力。然而值得关注的是,这些基础模型在通用域自适应(UniDA)领域尚未得到充分探索。UniDA的目标是利用源域中的有标签数据和目标域中的无标签数据训练模型,使训练后的模型成功适配目标数据。本文对采用基础模型的最先进UniDA方法进行了全面的实证研究。我们首先证明:尽管基础模型显著提升了仅在源数据上训练基线方法的性能,但现有UniDA方法通常无法在基线基础上实现改进。这表明针对基于基础模型的UniDA亟需新的研究思路。为此,我们提出了一种极为简洁的CLIP模型目标数据蒸馏方法,该方法在所有UniDA基准测试中均能稳定超越基线结果。本研究采用新提出的通用分类率(UCR)评估指标,该指标无需设定阈值和比率,有效解决了现有H-score指标存在的阈值敏感性问题。