In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
翻译:在争议性领域,指令微调语言模型必须在用户一致压力与遵循上下文证据之间取得平衡。为评估这一张力,我们引入了一个基于美国国家气候评估的可控认知冲突框架。我们对19个参数规模从0.27B到32B的指令微调模型进行了细粒度消融实验,涵盖证据组成与不确定性线索。在中立提示下,更丰富的证据通常能提升证据一致准确率与序数评分性能。然而在用户压力下,该受控固定证据环境中,证据并不能可靠地阻止用户一致倾向的逆转。我们报告了三种主要失效模式。首先,我们发现一种负向部分证据交互作用:引入认知细微差别(特别是研究空白)会增加Llama-3和Gemma-3等模型家族对谄媚行为的敏感性。其次,鲁棒性呈非单调缩放:部分家族中,某些低至中等规模模型对对抗性用户压力尤为敏感。第三,模型在冲突下的分布集中度存在差异:某些指令微调模型在压力下保持尖锐的序数分布,而其他模型的分布则显著分散;在规模匹配的Qwen对比中,推理蒸馏变体(DeepSeek-R1-Qwen)比其指令微调对应模型始终表现出更高的分散度。这些发现表明,在受控固定证据环境中,若不进行明确的认知完整性训练,仅提供更丰富的上下文证据无法保证抵抗用户压力。