Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a novel Top-Down Compression paradigm that strategically compresses visual tokens without compromising core information. Specifically, we construct a trainable Flash Global Fusion module based on efficient selective state space operators, which aligns the feature space while enabling each token to perceive holistic visual context and instruction preference at low cost. Furthermore, a local-to-single scanning manner is employed to effectively capture local dependencies, thereby enhancing the model's capability in vision modeling. To alleviate computational overhead, we explore a Visual-Native Selection mechanism that independently assesses token significance by both the visual and native experts, followed by aggregation to retain the most critical subset. Extensive experiments show that our approach reduces visual tokens by 75--95% while achieving comparable or superior performance across 12 benchmarks, significantly improving efficiency.
翻译:视觉指令微调旨在使大型语言模型能够理解视觉世界,其核心挑战在于建立有效的视觉到语言投影。然而,现有方法往往难以在准确性与效率之间取得可处理的平衡。本文提出LLaVA-Meteor,一种旨在打破这一僵局的新方法,其配备了一种新颖的自上而下压缩范式,可在不损害核心信息的前提下策略性地压缩视觉令牌。具体而言,我们基于高效选择性状态空间算子构建了一个可训练的Flash Global Fusion模块,该模块在对齐特征空间的同时,能够以低成本使每个令牌感知整体视觉上下文与指令偏好。此外,我们采用局部到单点扫描方式以有效捕获局部依赖关系,从而增强模型在视觉建模方面的能力。为减轻计算开销,我们探索了一种视觉-原生选择机制,该机制通过视觉专家与原生专家独立评估令牌重要性,随后进行聚合以保留最关键的子集。大量实验表明,我们的方法在12个基准测试上实现相当或更优性能的同时,将视觉令牌减少了75–95%,显著提升了效率。