Recent advances achieved by deep learning models rely on the independent and identically distributed assumption, hindering their applications in real-world scenarios with domain shifts. To address the above issues, cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data. However, in visual cross-domain learning, traditional methods concentrate solely on the image modality, neglecting the use of the text modality to alleviate the domain shift. In this work, we propose Large Language models as Visual cross-dOmain learners (LLaVO). LLaVO uses vision-language models to convert images into detailed textual descriptions. A large language model is then finetuned on textual descriptions of the source/target domain generated by a designed instruction template. Extensive experimental results on various cross-domain tasks under the domain generalization and unsupervised domain adaptation settings have demonstrated the effectiveness of the proposed method.
翻译:近年来,深度学习模型取得的进展依赖于独立同分布假设,这限制了它们在存在领域偏移的真实场景中的应用。为解决上述问题,跨领域学习旨在提取领域不变知识以减小训练数据与测试数据之间的领域偏移。然而,在视觉跨领域学习中,传统方法仅关注图像模态,忽视了利用文本模态来缓解领域偏移。在本工作中,我们提出大型语言模型作为视觉跨领域学习者(LLaVO)。LLaVO使用视觉-语言模型将图像转换为详细的文本描述,随后通过设计的指令模板对源/目标领域的文本描述进行微调。在领域泛化与无监督领域自适应设置下的多种跨领域任务上的大量实验结果证明了所提方法的有效性。