Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.
翻译:由于医疗领域的人工标注需要专家参与,精心整理的数据集通常规模有限。本文提出MedEval——一个多层级、多任务、多领域的医学基准,旨在促进面向医疗场景的语言模型开发。MedEval具有全面性,其数据来源于多个医疗系统,涵盖8种检查模态下的35个身体区域。我们基于收集的22,779个句子和21,228份报告,在多层级上提供专家标注,以此实现数据的细粒度潜在利用并支持广泛任务。此外,我们系统评估了10种通用型和领域专用语言模型——涵盖医疗领域自适应基线模型到通用型先进大语言模型(如ChatGPT)——在零样本和微调设置下的性能。评估结果揭示了这两类语言模型在不同任务上存在差异性效能,我们从中注意到指令调优对于大语言模型少样本使用的重要性。本研究为医疗场景下的语言模型基准测试奠定基础,并为采用大语言模型在医学领域中的优势与局限提供了宝贵见解,从而指导其实际应用与未来发展。