In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
翻译:在当前图像描述范式中,深度学习模型通过训练从潜在特征的图像嵌入生成文本。我们质疑了这些潜在特征必须是高维向量且需要模型微调来处理的假设。本文提出标签增强检索增强生成(LaB-RAG),这是一种基于文本的图像描述方法,利用分类标签形式的图像描述符来增强基于预训练大语言模型(LLM)的标准检索增强生成(RAG)。我们在放射学报告生成(RRG)背景下研究该方法,该任务旨在生成临床医生根据一组放射影像(如X光片)记录观察结果的报告。我们认为,在提取的图像嵌入上使用简单的线性分类器,可以有效地将X光片转化为放射学特定标签的文本空间表示。结合标准RAG,我们证明这些衍生的文本标签可与通用领域LLM结合生成放射学报告。在从未训练生成语言模型或图像特征编码器模型,且从未直接向LLM“展示”X光片的情况下,我们证明LaB-RAG在自然语言和放射学语言指标上优于其他基于检索的RRG方法,同时与经过微调的视觉-语言RRG模型相比获得具有竞争力的结果。我们进一步展示了LaB-RAG各组件实验的结果,以更好地理解该方法。最后,我们对常用RRG评估指标的使用提出批判,论证了在不存在真实数据泄露的情况下仍可能人为夸大其评估结果。