As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.
翻译:随着大型语言模型(LLMs)成为数百万人获取健康信息的主要来源,其在女性健康领域的准确性仍亟待检验。本文提出女性健康基准(WHB),这是首个专门评估LLMs在女性健康领域性能的基准。该基准包含96个经过严格验证的模型测试单元,涵盖五个医学专科(妇产科、急诊医学、初级保健、肿瘤学和神经学)、三种查询类型(患者查询、临床医生查询及证据/政策查询)以及八类错误类型(剂量/用药错误、关键信息缺失、过时指南/治疗建议、错误治疗建议、事实性信息错误、鉴别诊断缺失/错误、紧急情况遗漏及不恰当建议)。我们对13个前沿LLMs进行了评估,结果揭示了令人担忧的差距:当前模型在女性健康基准上的平均失败率约为60%,且在不同专科和错误类型上表现差异显著。值得注意的是,所有模型普遍在“紧急情况遗漏”指标上表现不佳,而GPT-5等新型模型在避免不恰当建议方面显示出显著改进。我们的研究结果表明,人工智能聊天机器人目前尚无法在女性健康领域提供可靠建议。