The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate its effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight Llama 3.2-3B-Instruct. Despite relying on text input only, the MINT-derived model outperforms models trained with SFT, RAG, or DPO, and even outperforms Llama 3.1-405B-Instruct. (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization.
翻译:高质量多模态生物医学数据的稀缺性限制了预训练大型语言模型在专业生物医学任务上的有效微调能力。为应对这一挑战,我们提出了MINT(多模态集成知识迁移)框架,该框架通过偏好优化将单模态大型解码器模型与多模态生物医学数据中的领域特定决策模式进行对齐。虽然MINT支持不同的优化技术,但我们主要采用比值比偏好优化框架作为其核心实现。该策略使对齐后的大型语言模型能够仅使用文本或图像输入执行预测任务,同时保留从多模态数据中学到的知识。MINT利用在上游高质量多模态数据上训练的多模态机器学习模型,将领域特定知识迁移至下游仅文本或仅图像的大型语言模型。我们通过两个关键应用验证其有效性:(1)基于文本的罕见遗传疾病预测:MINT使用在面部照片和临床记录上训练的多模态编码器模型生成偏好数据集,用于对齐轻量级Llama 3.2-3B-Instruct模型。尽管仅依赖文本输入,MINT衍生的模型在性能上超越了使用监督微调、检索增强生成或直接偏好优化训练的模型,甚至优于Llama 3.1-405B-Instruct。(2)基于细胞核图像的组织类型分类:MINT采用视觉语言基础模型作为偏好生成器,该模型包含从文本和组织病理学图像中学到的知识,用于对齐下游仅图像模型。最终获得的MINT衍生模型显著提升了Llama 3.2-Vision-11B-Instruct在组织类型分类任务上的性能。综上所述,MINT通过偏好优化为单模态大型语言模型与高质量多模态专业知识对齐提供了有效策略。