Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
翻译:近期研究表明,扩展多模态大语言模型(MLLMs)规模能有效提升下游多模态任务的性能。当前主流MLLM范式(如LLaVA)通过静态视觉-语言映射器将视觉特征转化为文本类词元,使静态大语言模型经视觉指令微调获得理解视觉信息的能力。尽管此方法前景广阔,但采用相同参数的静态微调策略(指训练后模型参数固定)可能限制不同下游多模态任务的性能表现。为此,我们提出HyperLLaVA方法,通过联合自适应微调投影器与大语言模型参数,分别结合动态视觉专家与语言专家。这些专家基于超网络生成,通过视觉与语言引导产生自适应参数偏移,在两阶段训练中实现投影器与大语言模型的动态建模。实验表明,本方案在现有MLLM基准测试(包括MME、MMBench、SEED-Bench与LLaVA-Bench)中显著超越LLaVA。