大型语言模型的女性健康基准 (A Women's Health Benchmark for Large Language Models)

Victoria-Elisabeth Gruber,Razvan Marinescu,Diego Fajardo,Amin H. Nassar,Christopher Arkfeld,Alexandria Ludlow,Shama Patel,Mehrnoosh Samaei,Valerie Klug,Anna Huber,Marcel Gühner,Albert Botta i Orfila,Irene Lagoja,Kimya Tarr,Haleigh Larson,Mary Beth Howard

from arxiv, 15 pages, 6 Figures, 2 Tables

As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.

翻译：随着大型语言模型（LLMs）成为数百万人获取健康信息的主要来源，其在女性健康领域的准确性仍亟待检验。本文提出女性健康基准（WHB），这是首个专门评估LLMs在女性健康领域性能的基准。该基准包含96个经过严格验证的模型测试单元，涵盖五个医学专科（妇产科、急诊医学、初级保健、肿瘤学和神经学）、三种查询类型（患者查询、临床医生查询及证据/政策查询）以及八类错误类型（剂量/用药错误、关键信息缺失、过时指南/治疗建议、错误治疗建议、事实性信息错误、鉴别诊断缺失/错误、紧急情况遗漏及不恰当建议）。我们对13个前沿LLMs进行了评估，结果揭示了令人担忧的差距：当前模型在女性健康基准上的平均失败率约为60%，且在不同专科和错误类型上表现差异显著。值得注意的是，所有模型普遍在“紧急情况遗漏”指标上表现不佳，而GPT-5等新型模型在避免不恰当建议方面显示出显著改进。我们的研究结果表明，人工智能聊天机器人目前尚无法在女性健康领域提供可靠建议。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

一种Agent自主性风险评估框架 | 最新文献

专知会员服务

19+阅读 · 2025年10月24日

【NeurIPS2024】无需3D数据的开放词汇单目3D物体检测模型训练

专知会员服务

17+阅读 · 2024年11月26日

【ACL2022】理解知识库嵌入中的性别偏见,Understanding Gender Bias in Knowledge Base Embeddings

专知会员服务

10+阅读 · 2022年3月24日

【MM 2021】基于统一中间模态学习的视红外人再识别,Towards a Unified Middle Modality Learning for Visible-Infrared Person Re-Identification

专知会员服务

12+阅读 · 2022年3月22日