Multimodal Large Language Models (MLLMs) inherit the superior text understanding capabilities of LLMs and extend these capabilities to multimodal scenarios. These models achieve excellent results in the general domain of multimodal tasks. However, in the medical domain, the substantial training costs and the requirement for extensive medical data pose challenges to the development of medical MLLMs. Furthermore, due to the free-text form of answers, tasks such as visual grounding that need to produce output in a prescribed form become difficult for MLLMs. So far, there have been no medical MLLMs works in medical visual grounding area. For the medical vision grounding task, which involves identifying locations in medical images based on short text descriptions, we propose Parameter-efficient Fine-tuning medical multimodal large language models for Medcial Visual Grounding (PFMVG). To validate the performance of the model, we evaluate it on a public benchmark dataset for medical visual grounding, where it achieves competitive results, and significantly outperforming GPT-4v. Our code will be open sourced after peer review.
翻译:多模态大语言模型(MLLMs)继承了LLMs卓越的文本理解能力,并将这些能力扩展到多模态场景。这些模型在多模态任务的通用领域取得了优异的结果。然而,在医学领域,高昂的训练成本以及对大量医学数据的需求,对医学MLLMs的发展构成了挑战。此外,由于答案的自由文本形式,像视觉定位这类需要以规定形式输出结果的任务对MLLMs而言变得困难。迄今为止,在医学视觉定位领域尚未有医学MLLMs的相关工作。针对医学视觉定位任务——即根据简短文本描述在医学图像中识别位置,我们提出了用于医学视觉定位的参数高效微调医学多模态大语言模型(PFMVG)。为验证模型性能,我们在一个公开的医学视觉定位基准数据集上对其进行了评估,其取得了具有竞争力的结果,并显著优于GPT-4v。我们的代码将在同行评审后开源。