Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.


翻译:近年来,视觉-语言模型(VLMs)的研究显著提升了跨模态推理的能力。然而,现有方法在领域变化时存在性能下降问题,或需要大量计算资源在新领域进行微调。为解决这一问题,我们开发了一种针对大型视觉-语言模型的新型适配方法,称为“无需训练的双曲双重适配器”(T-DHA)。我们采用双曲空间而非传统欧几里得空间来表征语义概念间的视觉-语言关系,这种关系通常具有层次树状结构。双曲空间具有随半径呈指数级增长的体积特性,与欧几里得空间的多项式增长形成鲜明对比。我们发现,利用庞加莱球模型嵌入层次数据结构时,这一独特性质能显著提升表征与判别能力。结合负向学习策略,该方法能以更少的特征维度实现更精准、更鲁棒的分类。我们在多个数据集上的广泛实验结果表明,T-DHA方法在少样本图像识别和领域泛化任务中显著优于现有最先进方法。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员