Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Rafiya Javed,Cassandra Parent,Jackie Kay,David Yanni,Abdullah Zaini,Anushe Sheikh,Maribeth Rauh,Walter Gerych,Ramona Comanescu,Iason Gabriel,Marzyeh Ghassemi,Laura Weidinger

Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.

翻译：回避与不确认是大语言模型（LLMs）限制明确支持特定陈述的行为。尽管这些行为在主观语境中是可取的，但在人权语境中（人权明确适用于所有群体）则不可取。我们提出一个系统性框架，用于衡量无约束LLM回复中针对不同身份群体的此类行为。我们评估了六个大型专有模型及一个开源权重LLM，涉及205个民族和无国籍族群身份的4738个提示，发现7个模型中有4个表现出显著依赖于群体身份的回避与不确认行为。尽管冲突信号、主权（身份是否为无国籍）或经济指标（GDP）等因素也会影响模型行为，但其效应量始终弱于身份本身的影响。这种系统性差异对提示改写方法具有稳健性。由于群体身份是这些行为的最强预测因子，我们利用开源权重模型探索是否可通过对这些群体身份应用导向（steering）和正交化（orthogonalization）技术来降低回避与不确认行为的发生率。研究发现，群体导向（group steering）是跨查询类型最有效的去偏方法，且对下游遗忘（downstream forgetting）具有稳健性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日

《以人为中心的大型语言模型（LLM）研究综述》

专知会员服务

41+阅读 · 2024年11月25日

重新思考不确定性：大语言模型时代的关键综述与分析

专知会员服务

39+阅读 · 2024年11月20日