With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.
翻译:随着生成式大语言模型(LLM)能力的不断提升,其在复杂医疗任务中的应用正受到日益广泛的关注。然而,这些模型在真实世界临床场景中的有效性仍缺乏深入探究。为此,我们提出了CliniBench——首个能够在MIMIC-IV数据集上,针对入院记录预测出院诊断任务,实现经过充分研究的编码器分类器与生成式LLM性能可比性的基准。我们通过大规模实验比较了12个生成式LLM与3个编码器分类器,结果表明编码器分类器在诊断预测任务中持续优于生成式模型。此外,我们评估了多种基于相似病例进行上下文学习的检索增强策略,发现这些策略能为生成式LLM带来显著的性能提升。