Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We argue that removing visual redundancy can simultaneously improve both efficiency and performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.With these two modules, the proposed FocusLLaVA achieves improvements in both efficiency and performance. We validate the effectiveness of our approach on a wide range of evaluation datasets.
翻译:多模态大语言模型的最新进展表明,高分辨率图像输入对于模型能力至关重要,尤其是在细粒度任务中。然而,高分辨率图像会导致输入大语言模型的视觉标记数量呈二次增长,从而产生巨大的计算成本。当前的研究致力于开发视觉标记压缩方法以实现效率提升,但这往往以牺牲性能为代价。我们认为,去除视觉冗余可以同时提升效率和性能。我们构建了一种从粗到细的视觉标记压缩方法,其中包含一个视觉引导采样器,用于压缩信息密度低的冗余区域,以及一个文本引导采样器,用于选择与用户指令强相关的视觉标记。通过这两个模块,所提出的 FocusLLaVA 在效率和性能上均取得了提升。我们在广泛的评估数据集上验证了该方法的有效性。