The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.
翻译:大型语言模型(LLMs)在辅助临床医生方面的应用已引起广泛关注。现有研究主要采用带有预设答案选项的封闭式问答任务进行评估。然而,许多临床决策涉及回答开放式问题,且没有预设选项。为更深入理解LLMs在临床环境中的表现,我们构建了基准测试集ClinicBench。我们首先收集了涵盖临床语言生成、理解与推理等多样化任务的十一个现有数据集。此外,我们构建了六个贴近真实临床场景的新型数据集及复杂任务,包括:转诊问答、治疗建议、住院病历(长文档)摘要、患者教育、新兴药物药理学问答及药物相互作用分析。我们对二十二个LLMs在零样本和少样本设置下进行了全面评估。最后,我们邀请医学专家对LLMs的临床实用性进行专业评估。