大型语言模型健康公平性危害与偏见揭示工具箱 (A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models)

Stephen R. Pfohl,Heather Cole-Lewis,Rory Sayres,Darlene Neal,Mercy Asiedu,Awa Dieng,Nenad Tomasev,Qazi Mamunur Rashid,Shekoofeh Azizi,Negar Rostamzadeh,Liam G. McCoy,Leo Anthony Celi,Yun Liu,Mike Schaekermann,Alanna Walton,Alicia Parrish,Chirag Nagpal,Preeti Singh,Akeiylah Dewitt,Philip Mansfield,Sushant Prakash,Katherine Heller,Alan Karthikesalingam,Christopher Semturs,Joelle Barral,Greg Corrado,Yossi Matias,Jamila Smith-Loud,Ivor Horn,Karan Singhal

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes, we hope that it can be leveraged and built upon towards a shared goal of LLMs that promote accessible and equitable healthcare.

翻译：大型语言模型（LLMs）在满足复杂健康信息需求方面前景广阔，但也可能引发危害并加剧健康不平等。可靠评估与公平性相关的模型缺陷是开发促进健康公平系统的关键步骤。我们提出了一套资源与方法论，用于揭示LLM在生成长篇医学问题答案时可能引发公平性危害的偏见，并以Med-PaLM 2模型为例开展了大规模实证研究。我们的贡献包括：用于人工评估LLM生成答案偏见的多因素框架，以及EquityMedQA——一个包含七个对抗性查询增强数据集的集合。我们的人工评估框架与数据集设计流程均基于迭代式参与式方法及对Med-PaLM 2生成答案的审查。实证研究表明，我们的方法能揭示那些通过狭义评估可能被忽略的偏见。实践经验证明，采用多样化评估方法并吸纳不同背景和专业知识的评估者至关重要。虽然本方法尚不足以全面评估AI系统部署是否促进健康结果公平，但我们希望它能为实现可访问、公平的医疗健康LLM这一共同目标提供基础与拓展方向。

相关内容

Microsoft Surface

关注 5

Surface 是微软公司（ Microsoft）旗下一系列使用 Windows 10（早期为 Windows 8.X）操作系统的电脑产品，目前有 Surface、Surface Pro 和 Surface Book 三个系列。 2012 年 6 月 18 日，初代 Surface Pro/RT 由时任微软 CEO 史蒂夫·鲍尔默发布于在洛杉矶举行的记者会，2012 年 10 月 26 日上市销售。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日