Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.
翻译:视觉大语言模型(VLLMs)的视觉能力始终落后于其语言能力。特别是,大量基准研究表明,当需要细粒度视觉信息或空间推理时,VLLMs表现不佳。然而,我们尚未完全理解为何VLLMs在这些任务上相对于其他任务存在如此显著的困难。一些研究聚焦于视觉冗余作为解释,即高层视觉信息均匀分布在大量令牌中,而具体的细粒度视觉信息则被丢弃。在本工作中,我们更详细地探究了这一前提,旨在更准确地理解模型如何处理各类视觉信息,以及哪些类型的视觉信息会被舍弃。为此,我们引入了一个简单的合成基准数据集,该数据集专门构建用于探测各种视觉特征,同时配备了一套用于测量视觉冗余的指标,从而使我们能够更好地理解二者关系的细微差别。接着,我们探索在多个复杂视觉任务上对VLLMs进行微调,以更好地理解冗余和压缩如何根据模型训练数据的复杂性而变化。我们发现任务复杂性与视觉压缩之间存在关联,这意味着拥有足够比例的高复杂性视觉数据对于改变VLLMs分配其视觉表示的方式至关重要,从而提升其在复杂视觉任务上的性能。我们希望这项工作能为训练下一代VLLMs提供有价值的见解。