As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.
翻译:随着最先进的大语言模型变得无处不在,确保其在多样化人口统计群体中的公平表现至关重要。然而,尚不清楚这些差异源于明确陈述的身份本身,还是源于身份信号传递方式。在实际交互中,用户的身份通常通过多种社会语言因素的复杂组合隐式传达。本研究采用因子设计,基于两个开源大语言模型(Gemma-3-12B 和 Qwen-3-VL-8B)生成的超过24,000条响应,分离这些信号,比较了显式声明用户画像的提示与跨多种敏感领域的隐式方言信号(例如非裔美国英语、新加坡英语)的表现。我们的结果揭示了大语言模型安全性中的独特悖论:用户通过“听起来像”某个群体而非“声明”属于该群体,获得了“更好”的性能。对于黑人群体的参考文本,显式身份提示会激活激进的安全过滤器,增加拒绝率并降低语义相似度。相比之下,隐式方言线索触发了强大的“方言越狱”,在将拒绝概率降至接近零的同时,与标准美式英语提示相比,实现了更高水平的参考文本语义相似度。然而,这种“方言越狱”在内容净化方面引入了关键的安全权衡。我们发现,当前的安全对齐技术脆弱且过度依赖显式关键词,造成了一种分叉的用户体验:“标准”用户接收谨慎、净化的信息,而方言使用者则面对较少净化、更原始、甚至可能更充满敌意的信息环境。这突显了对齐中的基本矛盾——在公平性与语言多样性之间——并强调了需要超越显式线索的泛化安全机制。