Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from an established stigmatization framework, our analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.
翻译:大型语言模型(LLMs)已被证明对某些群体表现出不平衡的偏见。然而,关于LLMs对高危人群进行无端针对性攻击的研究仍显不足。本文提出三项新颖贡献:(1)对LLMs针对高度脆弱的心理健康群体生成攻击内容进行显式评估;(2)提出基于网络的框架以研究相对偏见的传播机制;(3)评估这些攻击所产生的相对污名化程度。通过对近期发布的大规模偏见审计数据集的分析,我们发现心理健康实体在攻击叙事网络中占据核心位置,其接近中心性均值显著更高(p值 = 4.06e-10)且呈现密集聚类特征(基尼系数 = 0.7)。基于成熟的污名化理论框架,分析表明在生成链中,与心理健康障碍相关目标的标签化成分相较于初始目标有所增加。这些发现共同揭示了大型语言模型在强化有害话语方面存在的结构性倾向,并凸显了采取适当缓解措施的必要性。