Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.
翻译:大型视觉语言模型(LVLMs)在生成图像描述时往往遗漏或歪曲关键视觉内容。最小化此类信息损失将迫使LVLMs聚焦图像细节以生成精准描述。然而,由于视觉内容与文本输出间的模态鸿沟,量化模态转换过程中的信息损失本质上极具挑战性。本文提出图像描述质量与其文本检索图像的相似度呈正相关。基于此洞察,我们进一步提出跨模态身份映射(CIM)——一种无需额外标注即可增强图像描述的强化学习框架。具体而言,该方法从两个维度定量评估信息损失:图库表征一致性与查询-图库图像相关性。在这些指标的监督下,LVLM最小化信息损失,旨在实现从图像到描述的恒等映射。实验结果表明,即使与监督微调方法相比,我们的方法在图像描述任务中仍展现出卓越性能。特别在COCO-LN500基准测试上,CIM使Qwen2.5-VL-7B的关系推理能力提升20%。