As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.
翻译:随着语言模型在众多领域扮演日益综合的角色,LLM对用户质疑的响应成为一个关键的校准特性。然而,许多现有评估将顺从视为单向的,仅衡量模型是否抵制压力,而非是否选择性地抵制。我们引入了“顺从不对称性”(A = BCR/HCR)这一双向诊断指标,比较在有益提示下的有益输出变化与在误导性提示下的有害变化。通过对9个模型和972,000个提示条件响应的分析,我们发现这种选择性在事实判断和道德判断中有所不同:对于事实问题,模型遵循有益提示多于有害提示(A = 1.58);但在道德问题上,模型对两个方向的遵循率几乎相同(A = 1.04)。这一现象贯穿不同模型家族、能力水平和提示类型。有趣的是,我们还发现链式思维提示同时放大了有益和有害的顺从性,而基于身份的提示则几乎以相同的幅度抑制了这两者。这些结果表明,方向盲目的道德顺从是当前LLM中一种独特的失败模式,并提示校准应针对方向校正的更新,而非仅降低顺从性。