Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a sentence. To mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is https://medrega.github.io.

翻译：目前已有多种医疗多模态大语言模型（MLLM）被开发出来，用于处理涉及不同医学模态的视觉图像与文本指令的任务，并取得了令人瞩目的成果。当前大多数医疗通用模型是区域无关的，将整个图像视为整体表征。然而，这些模型在生成句子时难以识别其具体关注的是哪些区域。为了模拟医生的行为模式——医生通常先通览整幅图像，然后聚焦于特定区域进行细致评估——我们旨在增强医疗MLLM理解整个医学扫描图像中解剖区域的能力。为实现这一目标，我们首先定义了以区域为中心的任务，并构建了大规模数据集MedRegInstruct，将区域信息融入训练过程。结合我们收集的数据集与其他医疗多模态语料进行训练，我们提出了一种区域感知的医疗MLLM——MedRegA，这是首个能同时处理跨广泛模态的图像级和区域级医疗视觉语言任务的双语通用医疗人工智能系统。我们的MedRegA不仅能够执行三项以区域为中心的任务，还在涵盖8种模态的视觉问答、报告生成和医学图像分类任务中取得了最佳性能，展现出卓越的通用性。实验表明，我们的模型不仅能在双语环境下跨多种医疗视觉语言任务实现强大性能，还能识别和检测多模态医学扫描中的结构，从而显著提升了医疗MLLM的可解释性与用户交互性。项目页面为 https://medrega.github.io。