Large Language Models (LLMs) have introduced a new era of proficiency in comprehending complex healthcare and biomedical topics. However, there is a noticeable lack of models in languages other than English and models that can interpret multi-modal input, which is crucial for global healthcare accessibility. In response, this study introduces Qilin-Med-VL, the first Chinese large vision-language model designed to integrate the analysis of textual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer (ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum training process that includes feature alignment and instruction tuning. This method enhances the model's ability to generate medical captions and answer complex medical queries. We also release ChiMed-VL, a dataset consisting of more than 1M image-text pairs. This dataset has been carefully curated to enable detailed and comprehensive interpretation of medical data using various types of images.
翻译:大语言模型(LLMs)在理解复杂医疗和生物医学主题方面开启了高效能力的新时代。然而,在非英语语言模型以及能解读多模态输入的模型方面存在明显空白,这对全球医疗可及性至关重要。为此,本研究引入Qilin-Med-VL——首个旨在整合文本与视觉数据分析的中文大型视觉语言模型。Qilin-Med-VL结合了预训练的视觉Transformer(ViT)与基础大语言模型,并经过包括特征对齐和指令微调在内的两阶段课程训练流程。该方法提升了模型生成医学描述和回答复杂医学查询的能力。我们还发布了ChiMed-VL数据集,其中包含超过100万对图像-文本对。该数据集经过精心筛选,使得利用多种图像类型对医学数据进行详细而全面的解读成为可能。