While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology. For each demographic category and even for American football team fandom, we find that ChatGPT appears to infer a likely political ideology and modify guardrail behavior accordingly.
翻译:尽管生产环境中语言模型的偏见已被广泛记录,但其护栏机制的偏见却长期被忽视。本文研究了用户上下文信息如何影响大语言模型拒绝执行请求的可能性。通过生成包含意识形态与人口统计信息的用户传记,我们在 GPT-3.5 的护栏敏感度中发现了多种偏见。当请求审查内容或非法信息时,年轻、女性和亚裔美国人身份更易触发拒绝护栏。护栏机制亦存在谄媚倾向,会拒绝提供用户可能反对的政治立场信息。研究发现,特定身份群体及看似无害的信息(如体育迷身份)可能引发与直接声明政治意识形态相似的护栏敏感度变化。针对每个人口统计类别乃至美式橄榄球队粉丝群体,ChatGPT 似乎都会推断其潜在政治立场并相应调整护栏行为。