Medical report generation demands automatic creation of coherent and precise descriptions for medical images. However, the scarcity of labelled medical image-report pairs poses formidable challenges in developing large-scale neural networks capable of harnessing the potential of artificial intelligence, exemplified by large language models. This study builds upon the state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2, to customize general large-scale foundation models. Integrating adapter tuning and a medical knowledge enhancement loss, our model significantly improves accuracy and coherence. Validation on the dataset of ImageCLEFmedical 2023 demonstrates our model's prowess, achieving the best-averaged results against several state-of-the-art methods. Significant improvements in ROUGE and CIDEr underscore our method's efficacy, highlighting promising outcomes for the rapid medical-domain adaptation of the vision-language foundation models in addressing challenges posed by data scarcity.
翻译:医学报告生成要求自动生成对医学图像连贯且精确的描述。然而,由于标注的医学图像-报告对数据稀缺,开发能够充分利用人工智能潜力(例如大型语言模型)的大规模神经网络面临严峻挑战。本研究基于最先进的视觉语言预训练与微调方法BLIP-2,对通用大规模基础模型进行定制。通过集成自适应微调与医学知识增强损失函数,我们的模型显著提高了准确性和连贯性。在ImageCLEFmedical 2023数据集上的验证表明,与多种最先进方法相比,该模型取得了最佳平均结果。ROUGE和CIDEr指标的显著提升验证了我们方法的有效性,凸显了视觉语言基础模型在应对数据稀缺挑战时快速适应医学领域的巨大潜力。