A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Stephen R. Pfohl,Heather Cole-Lewis,Rory Sayres,Darlene Neal,Mercy Asiedu,Awa Dieng,Nenad Tomasev,Qazi Mamunur Rashid,Shekoofeh Azizi,Negar Rostamzadeh,Liam G. McCoy,Leo Anthony Celi,Yun Liu,Mike Schaekermann,Alanna Walton,Alicia Parrish,Chirag Nagpal,Preeti Singh,Akeiylah Dewitt,Philip Mansfield,Sushant Prakash,Katherine Heller,Alan Karthikesalingam,Christopher Semturs,Joelle Barral,Greg Corrado,Yossi Matias,Jamila Smith-Loud,Ivor Horn,Karan Singhal

Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all.

翻译：大语言模型在满足复杂健康信息需求方面展现出巨大潜力，但也可能引入危害并加剧健康差异。可靠评估与公平性相关的模型失效，是开发促进健康公平的系统关键步骤。本研究提出用于揭示长篇幅、大语言模型生成医学回答中可能引发公平性相关偏见的资源与方法，并以Med-PaLM 2为对象开展实证案例研究，形成了该领域迄今最大规模的人工评估研究。我们的贡献包括：用于人工评估大语言模型生成回答中偏见的多因素框架，以及EquityMedQA——由七个新发布数据集组成的集合，涵盖人工整理与大语言模型生成的对抗性查询。人工评估框架与数据集设计流程均植根于迭代式参与方法，并基于对Med-PaLM 2对抗性查询回答中潜在偏见的系统性审查。实证研究表明，采用多种方法策展的数据集集合，结合包含多维度评估量规设计与多样化评估者小组的严谨评估协议，能够揭示传统窄范围评估方法可能遗漏的偏见。我们的经验强调了采用多样化评估方法并吸纳不同背景与专长评估者的重要性。需要明确的是，尽管本框架能识别特定形式的偏见，但尚不足以全面评估AI系统部署能否促进公平健康结果。我们希望更广泛的社区能借鉴并发展这些工具与方法，共同实现促进全民可及公平医疗的大语言模型愿景。