Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
翻译:大型语言模型(LLM)已被证明存在社会偏见,然而针对非受保护污名化身份的偏见仍未得到充分研究。此外,污名的哪些社会特征与LLM输出中的偏见相关尚不明确。心理学文献表明,污名包含六个共同的社会特征:审美性、可隐藏性、病程、破坏性、起源和危险性。本研究探讨人类和LLM对污名特征的评分,以及提示风格和污名类型,是否会影响LLM输出中对污名化群体的偏见。我们使用SocialStigmaQA基准测试(包含37个关于污名化身份的社会场景,例如决定是否推荐其参加实习),测量了三种广泛使用的LLM(Granite 3.0-8B、Llama-3.1-8B、Mistral-7B)对93个污名化群体的偏见。研究发现,被人类评为高危险性的污名(例如帮派成员或HIV感染者)在SocialStigmaQA提示中产生的偏见输出最多(占所有模型输出的60%),而社会人口学污名(例如亚裔美国人或高龄)产生的偏见输出最少(11%)。我们通过使用各LLM对应的护栏模型(Granite Guardian 3.0、Llama Guard 3.0、Mistral Moderation API)测试了偏见输出量是否能够减少。结果表明,偏见分别显著降低了10.4%、1.4%和7.8%。然而,我们发现对偏见有显著影响的特征在缓解后保持不变,且护栏模型经常无法识别提示中的偏见意图。这项工作对在涉及污名化群体的场景中使用LLM具有启示意义,我们建议未来研究应致力于改进用于偏见缓解的护栏模型。