Large-scale Vision-Language Models (VLMs) have demonstrated exceptional performance in natural vision tasks, motivating researchers across domains to explore domain-specific VLMs. However, the construction of powerful domain-specific VLMs demands vast amounts of annotated data, substantial electrical energy, and computing resources, primarily accessible to industry, yet hindering VLM research in academia. To address this challenge and foster sustainable and equitable VLM research, we present the Generalized Domain Prompt Learning (GDPL) framework. GDPL facilitates the transfer of VLMs' robust recognition capabilities from natural vision to specialized domains, without the need for extensive data or resources. By leveraging small-scale domain-specific foundation models and minimal prompt samples, GDPL empowers the language branch with domain knowledge through quaternion networks, uncovering cross-modal relationships between domain-specific vision features and natural vision-based contextual embeddings. Simultaneously, GDPL guides the vision branch into specific domains through hierarchical propagation of generated vision prompt features, grounded in well-matched vision-language relations. Furthermore, to fully harness the domain adaptation potential of VLMs, we introduce a novel low-rank adaptation approach. Extensive experiments across diverse domains like remote sensing, medical imaging, geology, Synthetic Aperture Radar, and fluid dynamics, validate the efficacy of GDPL, demonstrating its ability to achieve state-of-the-art domain recognition performance in a prompt learning paradigm. Our framework paves the way for sustainable and inclusive VLM research, transcending the barriers between academia and industry.
翻译:大规模视觉-语言模型(VLMs)在自然视觉任务中展现出卓越性能,激励了各领域研究人员探索特定领域的VLMs。然而,构建强大的领域专用VLMs需要海量的标注数据、大量的电能和计算资源,这些资源主要被工业界掌握,却阻碍了学术界在VLM领域的研究。为应对这一挑战并促进可持续和公平的VLM研究,我们提出了广义域提示学习(GDPL)框架。GDPL无需大量数据或资源,即可将VLMs从自然视觉到专业领域的鲁棒识别能力进行迁移。通过利用小规模领域专用基础模型和极少的提示样本,GDPL借助四元数网络为语言分支注入领域知识,揭示领域专用视觉特征与基于自然视觉的上下文嵌入之间的跨模态关系。同时,GDPL基于良好对齐的视觉-语言关系,通过生成的视觉提示特征的分层传播,引导视觉分支进入特定领域。此外,为充分挖掘VLMs的域适应潜力,我们引入了一种新颖的低秩适应方法。在遥感、医学成像、地质学、合成孔径雷达和流体动力学等不同领域的广泛实验验证了GDPL的有效性,展示了其在提示学习范式下实现最先进的域识别性能的能力。我们的框架为超越学术界与工业界障碍的可持续和包容性VLM研究铺平了道路。