The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.
翻译:大型语言模型(LLMs)在辅助临床医生方面的应用已引起显著关注。现有研究主要采用带有预设答案选项的封闭式问答(QA)任务进行评估。然而,许多临床决策涉及回答开放式问题,且没有预设选项。为了更好地理解LLMs在临床环境中的表现,我们构建了基准测试集ClinicBench。我们首先收集了涵盖多样化临床语言生成、理解与推理任务的十一个现有数据集。此外,我们构建了六个新颖数据集及贴近真实临床实践的复杂任务,即:转诊问答、治疗建议、住院(长文档)摘要、患者教育、药理学问答以及新兴药物的药物相互作用分析。我们对二十二个LLMs在零样本和少样本设置下进行了广泛评估。最后,我们邀请医学专家对LLMs的临床实用性进行评估。