There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., "white matter of spinal cord"). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.
翻译:随着高风险应用日益依赖基于有偏数据训练的模型,由此产生的有偏预测往往损害最弱势群体,这引发了越来越多的担忧。特别是,有偏的医疗数据可能导致健康相关应用和推荐系统产生危害患者护理、加剧健康结果差异的输出。近期提出的"通过AI实现公平性"框架主张,研究人员不应试图修正模型偏差,而应聚焦其根本原因,利用AI消除数据偏差。受此框架启发,我们采用包括大语言模型在内的自然语言处理模型处理医学课程中的偏差检测问题,并在包含4,105条医学专家标注偏差摘录的黄金标准数据集上进行评估,该数据集源自大型医学文献语料库。我们在合著者前期研究基础上进行扩展,通过添加包含社会身份标识术语的非标注文本来扩充负样本集。然而,其中部分术语(特别是与种族和民族相关的术语)可能具有不同含义(例如"脊髓白质")。为解决此问题,我们提出使用词义消歧模型通过剔除无关语句来提升数据集质量。随后我们评估了微调版BERT模型以及采用零样本和少样本提示的GPT模型。研究发现,虽然在许多自然语言处理任务中表现优异的大语言模型并不适用于偏差检测任务,但微调后的BERT模型在所有评估指标上均表现良好。