With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.
翻译:随着大型语言模型(LLM)在医疗领域的应用日益增多,利用基准数据集评估这些模型的性能变得至关重要。本文全面综述了医疗LLM任务中使用的各类基准数据集。这些数据集涵盖多种模态,包括文本、图像及多模态基准,聚焦于医疗知识的不同方面,如电子健康记录(EHR)、医患对话、医疗问答和医学图像描述。本综述按模态对数据集进行分类,讨论了其重要性、数据结构以及对临床任务(如诊断、报告生成和预测性决策支持)中LLM发展的影响。关键基准包括MIMIC-III、MIMIC-IV、BioASQ、PubMedQA和CheXpert,这些数据集推动了医疗报告生成、临床摘要和合成数据生成等任务的进步。本文总结了利用这些基准推进多模态医疗智能所面临的挑战与机遇,强调了对具有更高语言多样性、结构化组学数据及创新合成方法的数据集的需求。此项工作也为未来LLM在医学应用中的研究奠定了基础,为不断发展的医疗人工智能领域做出了贡献。