Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap is capable of maintaining performance when transferring from in-domain to out-of-domain scenarios. Extensive experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods. Our code is available at: https://github.com/FeiElysia/ViECap
翻译:图像生成文本旨在使用自然语言描述图像。近年来,基于预训练视觉语言模型(VLM)和大语言模型(LLM)的零样本图像描述方法取得了显著进展。然而,我们观察到并通过实验证明,这些方法易受到由LLM引发的模态偏差影响,倾向于生成包含图像中实际不存在但训练中频繁出现的物体(实体)的描述(即物体幻觉)。本文提出ViECap,一种可迁移解码模型,利用实体感知解码在已知和未知场景中生成描述。ViECap引入实体感知硬提示,引导LLM关注图像中的视觉实体,从而在不同场景下实现连贯描述生成。借助实体感知硬提示,ViECap在从域内到域外场景的迁移中能够保持性能。大量实验表明,ViECap在跨域(可迁移)描述任务上达到了新的最优水平,并在域内描述任务上相较于以往基于VLM的零样本方法表现出竞争性性能。本论文代码可访问:https://github.com/FeiElysia/ViECap