Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.
翻译:大语言模型(LLM)在医疗健康应用中展现出巨大潜力。然而,现有的评估基准对阿尔茨海默病及相关痴呆症(ADRD)的覆盖极为有限。为弥补这一空白,我们推出了ADRD-Bench,这是首个专门针对ADRD设计的、用于严格评估LLM的基准数据集。ADRD-Bench包含两个部分:1) ADRD统一问答,整合了来自七个成熟医学基准的1,352个问题,提供了对临床知识的统一评估;以及2) ADRD照护问答,这是一个新颖的、包含149个问题的集合,源自被广泛使用的、基于证据的脑健康管理项目——老龄脑照护(ABC)项目。该新集合由在ADRD综合照护领域具有全国性专业知识的项目指导设计,旨在缓解现有基准中实际照护情境的缺失。我们在提出的ADRD-Bench上评估了33个最先进的大语言模型。结果显示,开源通用模型的准确率在0.63至0.93之间(均值:0.78;标准差:0.09)。开源医学模型的准确率在0.48至0.93之间(均值:0.82;标准差:0.13)。闭源通用模型的准确率在0.83至0.91之间(均值:0.89;标准差:0.03)。虽然顶级模型取得了较高的准确率(>0.9),但案例研究表明,其推理质量和稳定性存在不一致性,限制了其可靠性,这突显出亟需进行领域特定的改进,以增强大语言模型基于日常照护数据的知识和推理能力。完整数据集可在 https://github.com/IIRL-ND/ADRD-Bench 获取。