Traditional domain adaptation assumes the same vocabulary across source and target domains, which often struggles with limited transfer flexibility and efficiency while handling target domains with different vocabularies. Inspired by recent vision-language models (VLMs) that enable open-vocabulary visual recognition by reasoning on both images and texts, we study open-vocabulary domain adaptation (OVDA), a new unsupervised domain adaptation framework that positions a pre-trained VLM as the source model and transfers it towards arbitrary unlabelled target domains. To this end, we design a Prompt Ensemble Self-training (PEST) technique that exploits the synergy between vision and language to mitigate the domain discrepancies in image and text distributions simultaneously. Specifically, PEST makes use of the complementary property of multiple prompts within and across vision and language modalities, which enables joint exploitation of vision and language information and effective learning of image-text correspondences in the unlabelled target domains. Additionally, PEST captures temporal information via temporal prompt ensemble which helps memorize previously learnt target information. Extensive experiments show that PEST outperforms the state-of-the-art consistently across 10 image recognition tasks.
翻译:传统的域适应假设源域和目标域具有相同的词汇集,这往往在处理具有不同词汇的目标域时面临迁移灵活性和效率受限的问题。受近期视觉-语言模型(VLM)能够通过图像与文本的联合推理实现开放词汇视觉识别的启发,我们提出一种新的无监督域适应框架——开放词汇域适应(OVDA),该框架将预训练的VLM作为源模型,并将其迁移至任意未标注的目标域。为此,我们设计了提示集成自训练(PEST)技术,通过挖掘视觉与语言之间的协同效应,同时缓解图像和文本分布中的域差异。具体而言,PEST利用视觉和语言模态内及跨模态多提示的互补特性,实现对视觉与语言信息的联合挖掘,并在未标注目标域中有效学习图像-文本对应关系。此外,PEST通过时序提示集成捕获时序信息,辅助记忆先前学习的目标域知识。大量实验表明,PEST在10项图像识别任务中均以一致优势超越当前最先进方法。