We present a radiology-specific multimodal model for the task for generating radiological reports from chest X-rays (CXRs). Our work builds on the idea that large language model(s) can be equipped with multimodal capabilities through alignment with pre-trained vision encoders. On natural images, this has been shown to allow multimodal models to gain image understanding and description capabilities. Our proposed model (MAIRA-1) leverages a CXR-specific image encoder in conjunction with a fine-tuned large language model based on Vicuna-7B, and text-based data augmentation, to produce reports with state-of-the-art quality. In particular, MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered. Manual review of model outputs demonstrates promising fluency and accuracy of generated reports while uncovering failure modes not captured by existing evaluation practices. More information and resources can be found on the project website: https://aka.ms/maira.
翻译:我们提出了一种专门用于从胸部X光片(CXR)生成放射学报告的放射学多模态模型。本研究基于以下理念:通过将大语言模型与预训练的视觉编码器对齐,可使其具备多模态能力。在自然图像领域,该方法已被证实能使多模态模型获得图像理解与描述能力。我们提出的模型(MAIRA-1)采用CXR专用图像编码器、基于Vicuna-7B微调的大语言模型以及基于文本的数据增强技术,生成了具有最先进质量水平的报告。具体而言,MAIRA-1在放射科医生对齐的RadCliQ指标及所有词汇指标上均实现了显著提升。对模型输出的人工审查显示,生成的报告在流畅性和准确性方面表现良好,同时揭示了现有评估方法未能捕捉的失败模式。更多信息与资源可访问项目网站:https://aka.ms/maira。