Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

翻译：大语言模型（LLMs）正越来越多地部署于需要精细伦理推理的场景中，然而现有偏差评估仅将模型输出简单划分为"有偏"或"无偏"。这种二元框架忽视了偏差实际呈现出的渐进性、情境敏感性特征。本研究通过两阶段方法（行为剖析与机制验证）弥补这一不足。在行为阶段，我们提出道德敏感性指数（MSI），该指标通过从抽象数值问题到基于历史与社会经济不公情境的七层压力测试，量化偏差输出的概率梯度。对四种领先模型（Claude 3.5、Qwen 3.5、Llama 3和Gemini 1.5）的评估揭示出对齐设计塑造的独特行为特征：例如Gemini 1.5在社会经济框架下于第五层级达到72.7%的MSI值，而Claude表现出与身份安全训练一致的锐利抑制现象。随后我们从机制层面验证这些行为模式。选取跨模型MSI分值最高的犯罪偏差场景作为探针，对三组能力层级（小语言模型SLMs、指令微调基础模型、推理蒸馏变体）的六个受控模型采用对数透镜、注意力分析、激活补丁与语义探测方法。电路级分析揭示了偏差的U型曲线：SLMs呈现强犯罪偏差；扩展至指令微调模型后偏差消失；推理蒸馏在参数规模不变的情况下将偏差重新提升至SLM水平，表明蒸馏在压缩推理痕迹时重新激活了浅层统计关联。关键的是，驱动高MSI分值的社会负载线索与机制验证中识别的偏差驱动电路一致激活，实现了跨阶段验证的交叉验证。