The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.
翻译:大型多模态模型的出现释放了人工智能的巨大潜力,尤其在病理学领域。然而,缺乏专业、高质量的基准阻碍了其发展与精准评估。为此,我们提出PathMMU——当前规模最大、质量最高且经专家验证的大型多模态模型(LMMs)病理学基准。该基准包含33,428道多模态多选题与24,067张来自不同来源的图像,每道题均配有正确答案解析。PathMMU的构建利用GPT-4V的先进能力,通过级联流程处理超30,000组图文对以丰富描述并生成相应问答对。尤为重要的是,为最大化PathMMU的权威性,我们邀请七位病理学家依据严格标准逐题审查其验证集与测试集,同时为PathMMU设定专家级性能基准。我们开展了广泛评估,包括对14个开源与4个闭源LMMs的零样本测试及其对图像退化的鲁棒性分析,并微调代表性LMMs以评估其对PathMMU的适应性。实验结果显示,先进LMMs在具挑战性的PathMMU基准上表现困难:性能最佳的GPT-4V零样本准确率仅达49.8%,远低于人类病理学家的71.8%。经微调后,显著更小的开源LMMs可超越GPT-4V,但仍未达到病理学家的专业水平。我们期待PathMMU能为病理学领域提供宝贵见解,并推动下一代专业化LMMs的发展。