Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.
翻译:视觉指令微调旨在使预训练的多模态大语言模型(MLLMs)能够遵循人类指令,以适应现实应用。然而,此类数据集的快速增长带来了显著的冗余,导致计算成本上升。现有的指令数据选择方法旨在剪除这种冗余,但主要依赖于计算密集的技术,例如基于代理模型的推理或基于训练的度量。因此,这些选择过程所产生的巨大计算成本往往加剧了它们本欲解决的效率瓶颈,对 MLLMs 的可扩展且有效的微调构成了重大挑战。为应对这一挑战,我们首先识别了一个关键但先前被忽视的因素:视觉特征分布中固有的各向异性。我们发现这种各向异性会导致一种“全局语义漂移”,而忽视这一现象是限制当前数据选择方法效率的关键因素。基于这一洞见,我们设计了 **PRISM**,首个用于高效视觉指令选择的免训练框架。PRISM 通过隐式重中心化对本质视觉语义进行建模,从而精准地消除了全局背景特征的干扰影响。实验表明,PRISM 将数据选择和模型微调的端到端时间缩短至传统流程的仅 30%。更为显著的是,它在实现这一效率的同时还提升了性能,在八个多模态和三个语言理解基准测试中超越了在全数据集上微调的模型,最终相对于基线实现了 101.7% 的相对性能提升。代码可通过 \href{https://github.com/bibisbar/PRISM}{此仓库} 获取。