Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.
翻译:大规模视觉语言模型(LVLMs)已取得显著进展,但其常受语言偏见影响,即在不依赖视觉证据的情况下生成答案。尽管先前研究尝试通过解码策略、架构修改或精选指令数据来缓解此问题,但它们通常缺乏对单个训练样本或标记实际从图像中获益程度的量化度量。本文提出视觉信息增益(VIG),一种基于困惑度的度量指标,用于衡量视觉输入所提供的预测不确定性降低程度。VIG能够在样本和标记级别进行细粒度分析,有效突显颜色、空间关系和属性等视觉基础元素。基于此,我们提出一种VIG引导的选择性训练方案,优先处理高VIG样本和标记。该方法通过专注于视觉信息丰富的样本和标记,增强了视觉基础能力并缓解了语言偏见,在显著减少监督量的同时实现了更优的性能。