Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can be adapted to out-domain data without additional fine-tuning or retraining. Our comprehensive experiments conducted on various benchmarks and synthetic commonsense-violating data demonstrate that EVCap, comprising solely 3.97M trainable parameters, exhibits superior performance compared to other methods of equivalent model size scale. Notably, it achieves competitive performance against specialist SOTAs with an enormous number of parameters. Our code is available at https://jiaxuan-li.github.io/EVCap.
翻译:基于大语言模型的图像描述方法能够描述训练数据中未显式观测到的物体;然而新物体频繁出现,使得维持最新的物体知识成为开放世界理解的必要条件。本文不依赖海量数据和扩大网络参数,提出了一种高效的检索增强图像描述方法,通过从外部视觉-名称记忆(EVCap)中检索物体名称来提示大语言模型。我们利用物体的视觉特征和名称构建动态更新的物体知识记忆,从而实现:(i)以极低成本更新记忆;(ii)通过轻量级快速训练的模型,将检索到的物体名称无缝注入大语言模型。仅在COCO数据集上训练后,本模型无需额外微调或重新训练即可适配域外数据。在多个基准测试和合成反常识数据上的全面实验表明,仅含397万可训练参数的EVCap,在同等模型规模方法中表现出色,且能以极小参数量达到与专业级最先进方法相媲美的性能。代码已开源:https://jiaxuan-li.github.io/EVCap。