Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.
翻译:双曲嵌入在多种深度学习任务(如图像分割和主动学习)中已证明其能有效捕捉不确定性度量和层次关系。然而,其在现代视觉语言模型中的应用仍较为有限。一个显著的例外是MERU,该模型在包含数亿参数的CLIP ViT-large模型中利用了双曲空间的层次特性。在本研究中,我们通过采用BLIP-2架构,解决了将多模态双曲模型的参数量(数十亿级)和训练复杂度提升数个数量级所面临的挑战。尽管双曲嵌入能提供欧几里得嵌入所不具备的不确定性洞察潜力,但我们的分析表明,扩展此类模型尤为困难。我们提出了一种针对双曲版BLIP-2的新型训练策略,该策略使其在保持训练过程稳定性的同时,能达到与欧几里得版本相当的性能,且每个嵌入都能显示出有意义的不确定性表征。