Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

翻译：大语言模型（LLMs）在需要细腻伦理推理的场景中应用日益广泛，然而现有偏见评估仅将模型输出简单划分为"有偏见"或"无偏见"。这种二元框架忽略了偏见实际涌现的渐进性与情境敏感性。我们通过行为剖析与机制验证两个阶段弥补这一不足。在行为阶段，我们提出道德敏感性指数（MSI），该指标通过从抽象数值问题到基于历史与社会经济不公情境的七级渐进压力测试，量化模型产生偏见输出的概率。通过对四个主流模型（Claude 3.5、Qwen 3.5、Llama 3和Gemini 1.5）的评估，我们识别出由对齐设计塑造的独特行为特征：例如在第五级社会经济框架下，Gemini 1.5的MSI达到72.7%，而Claude表现出与身份安全训练一致的强烈抑制效应。随后我们从机制上验证这些行为模式。我们选取各模型MSI得分最高的刑事偏见场景作为探针，对涵盖三个能力层级（小语言模型、指令调优基础模型、推理蒸馏变体）的六个模型，应用对数透镜、注意力分析、激活修补及语义探测方法。电路级分析揭示了偏见的U型曲线：小语言模型表现出强烈的刑事偏见；扩展至指令调优模型后偏见消失；推理蒸馏在参数总量不变的情况下将偏见重新提升至小语言模型水平，表明蒸馏通过压缩推理痕迹重新激活了浅层统计关联。关键的是，驱动高MSI得分的社会性负荷线索与机制层面识别的偏见驱动电路具有相同激活模式，为跨阶段验证提供了依据。