Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes, we hope that it can be leveraged and built upon towards a shared goal of LLMs that promote accessible and equitable healthcare.
翻译:大型语言模型(LLMs)在满足复杂健康信息需求方面前景广阔,但也可能引发危害并加剧健康不平等。可靠评估与公平性相关的模型缺陷是开发促进健康公平系统的关键步骤。我们提出了一套资源与方法论,用于揭示LLM在生成长篇医学问题答案时可能引发公平性危害的偏见,并以Med-PaLM 2模型为例开展了大规模实证研究。我们的贡献包括:用于人工评估LLM生成答案偏见的多因素框架,以及EquityMedQA——一个包含七个对抗性查询增强数据集的集合。我们的人工评估框架与数据集设计流程均基于迭代式参与式方法及对Med-PaLM 2生成答案的审查。实证研究表明,我们的方法能揭示那些通过狭义评估可能被忽略的偏见。实践经验证明,采用多样化评估方法并吸纳不同背景和专业知识的评估者至关重要。虽然本方法尚不足以全面评估AI系统部署是否促进健康结果公平,但我们希望它能为实现可访问、公平的医疗健康LLM这一共同目标提供基础与拓展方向。