Prompt learning has emerged as an effective and data-efficient technique in large Vision-Language Models (VLMs). However, when adapting VLMs to specialized domains such as remote sensing and medical imaging, domain prompt learning remains underexplored. While large-scale domain-specific foundation models can help tackle this challenge, their concentration on a single vision level makes it challenging to prompt both vision and language modalities. To overcome this, we propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of VLMs from generalized to specialized domains, using quaternion networks. Specifically, the proposed method involves using domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within the quaternion networks. Moreover, we present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features. In this way, quaternion networks can effectively mine the intermodal relationships in the specific domain, facilitating domain-specific vision-language contrastive learning. Extensive experiments on domain-specific datasets show that our proposed method achieves new state-of-the-art results in prompt learning.
翻译:提示学习已成为大规模视觉-语言模型(VLMs)中一种有效且数据高效的技术。然而,当将VLMs适配到遥感、医学影像等专业领域时,领域提示学习仍未被充分探索。尽管大规模领域专用基础模型有助于应对这一挑战,但它们在单一视觉层面的集中性使得同时提示视觉和语言模态变得困难。为此,我们提出利用领域专用基础模型中的领域知识,通过四元数网络将VLMs的鲁棒识别能力从通用领域迁移至专业领域。具体而言,该方法利用领域专用基础模型产生的领域视觉特征,引导语言分支的通用上下文嵌入在四元数网络内转换为专业空间。此外,我们提出一种层次化方法,通过分析层次化语言提示特征与领域视觉特征之间的跨模态关系来生成视觉提示特征。由此,四元数网络能在特定领域中有效挖掘跨模态关系,从而促进领域专用的视觉-语言对比学习。在领域专用数据集上的大量实验表明,我们提出的方法在提示学习中取得了最新的最优结果。