Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
翻译:多模态大语言模型通过结合广泛的临床知识解读医学影像,为疾病分类引入了一种新兴且具有变革性的范式。本研究对两种根本不同的AI架构进行了关键性比较:专门的开源智能体MedGemma与专有的大型多模态模型GPT-4,旨在对六种不同疾病进行诊断。经低秩自适应微调的MedGemma-4b-it模型展现出卓越的诊断能力,其平均测试准确率达到80.37%,而未经调优的GPT-4仅为69.58%。此外,在高风险临床任务(如癌症与肺炎检测)中,MedGemma表现出显著更高的敏感性。通过混淆矩阵与分类报告进行的定量分析,为所有疾病类别下的模型性能提供了全面深入的洞察。这些结果强调,领域特定的微调对于最小化临床实施中的幻觉现象至关重要,从而将MedGemma定位为一款适用于复杂、循证医学推理的先进工具。