Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.
翻译:回避与不确认是大语言模型(LLMs)限制明确支持特定陈述的行为。尽管这些行为在主观语境中是可取的,但在人权语境中(人权明确适用于所有群体)则不可取。我们提出一个系统性框架,用于衡量无约束LLM回复中针对不同身份群体的此类行为。我们评估了六个大型专有模型及一个开源权重LLM,涉及205个民族和无国籍族群身份的4738个提示,发现7个模型中有4个表现出显著依赖于群体身份的回避与不确认行为。尽管冲突信号、主权(身份是否为无国籍)或经济指标(GDP)等因素也会影响模型行为,但其效应量始终弱于身份本身的影响。这种系统性差异对提示改写方法具有稳健性。由于群体身份是这些行为的最强预测因子,我们利用开源权重模型探索是否可通过对这些群体身份应用导向(steering)和正交化(orthogonalization)技术来降低回避与不确认行为的发生率。研究发现,群体导向(group steering)是跨查询类型最有效的去偏方法,且对下游遗忘(downstream forgetting)具有稳健性。