Radiology report generation (RRG) is a challenging task, as it requires a thorough understanding of medical images, integration of multiple temporal inputs, and accurate report generation. Effective interpretation of medical images, such as chest X-rays (CXRs), demands sophisticated visual-language reasoning to map visual findings to structured reports. Recent studies have shown that multimodal large language models (MLLMs) can acquire multimodal capabilities by aligning with pre-trained vision encoders. However, current approaches predominantly focus on single-image analysis or utilise rule-based symbolic processing to handle multiple images, thereby overlooking the essential temporal information derived from comparing current images with prior ones. To overcome this critical limitation, we introduce Libra, a temporal-aware MLLM tailored for CXR report generation using temporal images. Libra integrates a radiology-specific image encoder with a MLLM and utilises a novel Temporal Alignment Connector to capture and synthesise temporal information of images across different time points with unprecedented precision. Extensive experiments show that Libra achieves new state-of-the-art performance among the same parameter scale MLLMs for RRG tasks on the MIMIC-CXR. Specifically, Libra improves the RadCliQ metric by 12.9% and makes substantial gains across all lexical metrics compared to previous models.
翻译:放射学报告生成(RRG)是一项具有挑战性的任务,因为它需要对医学图像有深入的理解、整合多个时序输入并生成准确的报告。有效解读医学图像(如胸部X光片)需要复杂的视觉-语言推理能力,以将视觉发现映射到结构化报告中。近期研究表明,多模态大语言模型(MLLMs)通过与预训练的视觉编码器对齐,能够获得多模态能力。然而,现有方法主要集中于单图像分析,或采用基于规则的符号化处理来处理多幅图像,从而忽视了通过对比当前图像与历史图像所获取的关键时序信息。为克服这一重要局限,我们提出了Libra——一种专为利用时序图像生成CXR报告而设计的时序感知MLLM。Libra将放射学专用图像编码器与MLLM相结合,并采用新颖的时序对齐连接器以前所未有的精度捕捉并综合不同时间点的图像时序信息。大量实验表明,在MIMIC-CXR数据集上,Libra在同等参数规模的MLLMs中实现了RRG任务的最新最优性能。具体而言,相较于先前模型,Libra将RadCliQ指标提升了12.9%,并在所有词汇指标上均取得显著进步。