HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.

翻译：多模态大语言模型（MLLMs）已成为对齐视觉与文本理解的一种变革性方法。它们通常需要极高的计算资源（例如数千个GPU）进行训练，以实现多粒度级别的跨模态对齐。我们认为，这种低效的一个关键根源在于其广泛配备的视觉编码器（例如CLIP和SAM）缺乏与语言在多粒度级别上的对齐。为解决这一问题，本文利用双曲空间，该空间本质上是层次结构的建模工具，从而为在任意粒度级别上弥合视觉与文本模态之间的粒度差距提供了一个原则性框架。具体而言，我们提出了一种名为HyperET的高效MLLM训练范式，它能够通过在双曲空间中进行动态双曲半径调整，将视觉表征优化至与任意粒度级别的文本表征对齐。HyperET采用具有Möbius乘法运算的可学习矩阵，通过三种有效配置实现：对角缩放矩阵、块对角矩阵和带状矩阵，从而提供了一种灵活而高效的参数化策略。在多个MLLM基准测试上的全面实验表明，HyperET仅以不到1%的额外参数，即可持续显著提升现有预训练和微调MLLM的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日