Generalist Large Language Models (LLMs), such as GPT-4, have shown considerable promise in various domains, including medical diagnosis. Rare diseases, affecting approximately 300 million people worldwide, often have unsatisfactory clinical diagnosis rates primarily due to a lack of experienced physicians and the complexity of differentiating among many rare diseases. In this context, recent news such as "ChatGPT correctly diagnosed a 4-year-old's rare disease after 17 doctors failed" underscore LLMs' potential, yet underexplored, role in clinically diagnosing rare diseases. To bridge this research gap, we introduce RareBench, a pioneering benchmark designed to systematically evaluate the capabilities of LLMs on 4 critical dimensions within the realm of rare diseases. Meanwhile, we have compiled the largest open-source dataset on rare disease patients, establishing a benchmark for future studies in this domain. To facilitate differential diagnosis of rare diseases, we develop a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases, significantly enhancing LLMs' diagnostic performance. Moreover, we present an exhaustive comparative study of GPT-4's diagnostic capabilities against those of specialist physicians. Our experimental findings underscore the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. This paves the way for exciting possibilities in future advancements in this field.
翻译:通用型大型语言模型(如GPT-4)在包括医疗诊断在内的多个领域展现出显著潜力。罕见病影响全球约3亿人口,由于经验丰富的医师匮乏及多种罕见病鉴别诊断的复杂性,其临床确诊率长期处于较低水平。在此背景下,"ChatGPT在17位医生误诊后正确诊断4岁儿童罕见病"等近期新闻凸显了LLMs在罕见病临床诊断中尚未充分发掘的潜力。为填补这一研究空白,我们提出RareBench——首个系统评估LLMs在罕见病领域四个关键维度能力的开创性基准。同时,我们构建了迄今最大的罕见病患者开源数据集,为后续研究建立基准。为促进罕见病鉴别诊断,我们开发了动态少样本提示方法,该方法整合多知识库形成的综合罕见病知识图谱,显著提升了LLMs的诊断表现。此外,我们开展专项对比研究,系统比较GPT-4与专科医师的诊断能力。实验结果表明,将LLMs整合至罕见病临床诊断流程具有广阔应用前景,为该领域未来发展开辟了令人振奋的新可能。