Large Language Models (LLMs) have consistently showcased remarkable generalization capabilities when applied to various language tasks. Nonetheless, harnessing the full potential of LLMs for Radiology Report Generation (R2Gen) still presents a challenge, stemming from the inherent disparity in modality between LLMs and the R2Gen task. To bridge this gap effectively, we propose R2GenGPT, which is a novel solution that aligns visual features with the word embedding space of LLMs using an efficient visual alignment module. This innovative approach empowers the previously static LLM to seamlessly integrate and process image information, marking a step forward in optimizing R2Gen performance. R2GenGPT offers the following benefits. First, it attains state-of-the-art (SOTA) performance by training only the lightweight visual alignment module while freezing all the parameters of LLM. Second, it exhibits high training efficiency, as it requires the training of an exceptionally minimal number of parameters while achieving rapid convergence. By employing delta tuning, our model only trains 5M parameters (which constitute just 0.07\% of the total parameter count) to achieve performance close to the SOTA levels. Our code is available at https://github.com/wang-zhanyu/R2GenGPT.
翻译:大语言模型在应用于各类语言任务时始终展现出卓越的泛化能力。然而,充分挖掘大语言模型在放射学报告生成(R2Gen)任务中的潜力仍面临挑战,其根源在于大语言模型与R2Gen任务之间固有的模态差异。为有效弥合这一差距,我们提出R2GenGPT——一种通过高效视觉对齐模块将视觉特征与大语言模型词嵌入空间对齐的创新解决方案。该创新方法使原本静态的大语言模型能够无缝整合并处理图像信息,标志着R2Gen性能优化迈出了关键一步。R2GenGPT具有以下优势:首先,通过仅训练轻量级视觉对齐模块同时冻结大语言模型所有参数,即可实现当前最优(SOTA)性能;其次,该方法具有极高的训练效率,仅需训练极少数量的参数即可快速收敛。采用增量微调策略后,我们的模型仅需训练500万参数(仅为总参数的0.07%),即可达到接近SOTA水平的性能。本研究的代码已开源至https://github.com/wang-zhanyu/R2GenGPT。