Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited by tokenization. Therefore, we evaluated LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities) while examining differences within and between LLMs in medical aspects and comparing their performance to humans. To generate straightforward multi-choice questions and answers (QAs) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA". EBMQA contains 105,000 QAs labeled with medical and non-medical topics and classified into numerical or semantic questions. We benchmarked this dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question types and according to sub-labeled topics. For validation, six medical experts were tested on 100 numerical EBMQA questions. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs. However, both LLMs showed inter and intra gaps in different medical aspects and remained inferior to humans. Thus, their medical advice should be addressed carefully.
翻译:临床问题解决需要处理语义医学知识(如疾病脚本)以及诊断测试的数值医学知识,以进行循证决策。尽管大型语言模型(LLMs)在基于语言的临床实践的许多方面显示出有希望的结果,但其生成非语言的循证临床问题答案的能力本质上受到分词处理的限制。因此,我们评估了LLMs在两种问题类型上的表现:数值型(关联发现)和语义型(区分实体),同时考察了LLMs内部及之间在医学方面的差异,并将其表现与人类进行了比较。为生成基于循证医学(EBM)的简明多项选择题及答案(QAs),我们利用一个综合性医学知识图谱(涵盖超过50,000篇同行评审文章的数据)创建了“EBMQA”。EBMQA包含105,000个QAs,标注有医学和非医学主题,并分类为数值型或语义型问题。我们使用超过24,500个QAs在两个最先进的LLMs(Chat-GPT4和Claude3-Opus)上对该数据集进行了基准测试。我们评估了LLMs在语义型和数值型问题类型上的准确性,并根据子标注主题进行了分析。为验证,六位医学专家在100个数值型EBMQA问题上进行了测试。我们发现,两种LLMs在语义型QAs上的表现均优于数值型,其中Claude3在数值型QAs上超越了GPT4。然而,两种LLMs在不同医学方面均显示出内部及之间的差距,且仍逊色于人类。因此,对其医学建议应谨慎对待。