Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.

翻译：近年来，多模态大语言模型（MLLM）取得了显著进展，证明了开发智能生物医学助手的可行性。然而，当前的生物医学MLLM主要侧重于图像级理解，并将交互限制于文本指令，从而限制了其能力边界与使用灵活性。本文提出了一种新颖的、具备像素级理解能力的端到端生物医学多模态大语言模型，命名为MedPLIB。令人振奋的是，它支持视觉问答（VQA）、任意像素级提示（点、边界框和自由形状）以及像素级定位。我们提出了一种新颖的混合专家（MoE）多阶段训练策略，该策略将MoE分为视觉语言专家模型和像素定位专家模型的独立训练阶段，随后使用MoE进行微调。此策略有效协调了多任务学习，同时将推理时的计算成本维持在相当于单个专家模型的水平。为推进生物医学MLLM的研究，我们引入了医学复杂视觉问答数据集（MeCoVQA），该数据集包含用于复杂医学影像问答和图像区域理解的8种模态。实验结果表明，MedPLIB在多项医学视觉语言任务中取得了最先进的成果。更重要的是，在像素定位任务的零样本评估中，MedPLIB在mDice指标上分别以19.7和15.6的领先优势超越了最佳的小型模型和大型模型。代码、数据和模型检查点将在 https://github.com/ShawnHuang497/MedPLIB 公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日