Vision-language models have emerged as a powerful tool for previously challenging multi-modal classification problem in the medical domain. This development has led to the exploration of automated image description generation for multi-modal clinical scans, particularly for radiology report generation. Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions. In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images. Leveraging the versatility of the Contrastive Language-Image Pre-training (CLIP), we refine and augment the existing approach through multiple experiments, including baseline model fine-tuning, adding station(s) as a superset for better correlation between organs, along with image and language augmentations. Our proposed approach demonstrates 47.6% performance improvement over baseline PubMedCLIP.
翻译:视觉-语言模型已成为解决医学领域先前具有挑战性的多模态分类问题的有力工具。这一进展推动了对多模态临床扫描图像自动描述生成技术的探索,特别是在放射学报告生成方面。现有研究主要集中于针对特定模态或身体部位的临床描述,尚缺乏能够提供全身多模态描述的模型。本文通过自动化生成多模态磁共振与计算机断层扫描放射图像中全身标准化检查部位及器官列表,填补了这一空白。利用对比语言-图像预训练(CLIP)框架的通用性,我们通过多项实验对现有方法进行改进与增强,包括基线模型微调、添加检查部位作为器官关联的超集,以及图像与语言增强策略。实验表明,我们提出的方法相较于基线PubMedCLIP模型实现了47.6%的性能提升。