The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for LMMs. It comprises 33,573 multimodal multi-choice questions and 21,599 images from various sources, and an explanation for the correct answer accompanies each question. The construction of PathMMU capitalizes on the robust capabilities of GPT-4V, utilizing approximately 30,000 gathered image-caption pairs to generate Q\&As. Significantly, to maximize PathMMU's authority, we invite six pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and three closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 51.7\% zero-shot performance, significantly lower than the 71.4\% demonstrated by human pathologists. After fine-tuning, even open-sourced LMMs can surpass GPT-4V with a performance of over 60\%, but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LLMs for pathology.
翻译:大型多模态模型的出现释放了人工智能在病理学领域的巨大潜力,但缺乏专业、高质量的基准阻碍了其发展和精准评估。为此,我们推出PathMMU——面向大型多模态模型(LMMs)的最大规模、最高质量的专家验证病理学基准。该基准包含来自不同来源的33,573个多模态选择题和21,599张图像,每个问题均附带正确答案的详细解释。PathMMU的构建充分利用了GPT-4V的强大能力,借助约30,000个收集到的图像-描述对生成问答内容。尤为重要的是,为最大化PathMMU的权威性,我们邀请六位病理学家在严格标准下对验证集和测试集中的每个问题进行审查,同时为其设立专家级性能基准。我们开展了广泛评估,包括对14个开源和3个闭源LMMs的零样本测试及其对图像损坏的鲁棒性分析,并对代表性LMMs进行微调以评估其对PathMMU的适应能力。实验结果表明,先进LMMs在具有挑战性的PathMMU基准上表现不佳:性能最优的LMM(GPT-4V)仅达到51.7%的零样本准确率,远低于人类病理学家71.4%的表现;而微调后的开源LMMs虽能超越GPT-4V(性能超过60%),但仍未达到病理学专家的专业水平。我们期望PathMMU能为病理学领域提供宝贵启示,推动下一代专用大型语言模型(LLMs)的发展。