Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.
翻译:针对大语言模型的道德基准测试通常采用无语境提示,隐含地假设模型具有稳定的偏好。然而在实际部署中,提示词往往包含用户请求、社会规范暗示等语境信号,这些信号可能引导模型的决策。本研究探讨了在电车难题式道德权衡场景中,定向语境影响如何重塑决策过程。我们设计了一个用于电车难题式道德权衡的定向语境影响评估框架:针对每个人口统计学因素,我们施加成对匹配、方向相反的语境影响(仅在其偏袒的群体上存在差异),从而实现对方向性响应的系统化测量。研究发现:(i)语境影响常显著改变决策,即使其仅具有表面相关性;(ii)基线偏好难以预测方向性可引导程度,模型可能表现为基线中立,却在影响下展现出系统性的可引导不对称性;(iii)影响可能产生反效果:模型可能明确宣称保持中立或忽略语境线索,但其选择仍会发生偏移,有时甚至出现反向偏移;(iv)推理过程会降低平均敏感度,但会放大带有偏见的少样本示例的影响。这些发现表明,有必要通过受控的、方向反转的语境操作来扩展道德评估,以更准确地刻画模型行为特征。