Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.
翻译:多模态大语言模型(MLLMs)已成为对齐视觉与文本理解的一种变革性方法。它们通常需要极高的计算资源(例如数千个GPU)进行训练,以实现多粒度级别的跨模态对齐。我们认为,这种低效的一个关键根源在于其广泛配备的视觉编码器(例如CLIP和SAM)缺乏与语言在多粒度级别上的对齐。为解决这一问题,本文利用双曲空间,该空间本质上是层次结构的建模工具,从而为在任意粒度级别上弥合视觉与文本模态之间的粒度差距提供了一个原则性框架。具体而言,我们提出了一种名为HyperET的高效MLLM训练范式,它能够通过在双曲空间中进行动态双曲半径调整,将视觉表征优化至与任意粒度级别的文本表征对齐。HyperET采用具有Möbius乘法运算的可学习矩阵,通过三种有效配置实现:对角缩放矩阵、块对角矩阵和带状矩阵,从而提供了一种灵活而高效的参数化策略。在多个MLLM基准测试上的全面实验表明,HyperET仅以不到1%的额外参数,即可持续显著提升现有预训练和微调MLLM的性能。