Making Bias Non-Predictive: Training Robust LLM Reasoning via Reinforcement Learning

Large language models (LLMs) increasingly serve as reasoners and automated evaluators, yet they remain susceptible to cognitive biases -- often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals.} Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues attractive. We propose \textbf{Epistemic Independence Training (EIT)}, a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. \revised{EIT further generalizes across benchmarks (MedQA, HellaSwag), model families (Llama-3.2-3B), and scales (Qwen3-8B), and outperforms distribution-shift methods (GroupDRO, IRM) without requiring environment labels.} Code and data are available at https://anonymous.4open.science/r/bias-mitigation-with-rl-BC47

翻译：大型语言模型（LLMs）越来越多地担任推理者和自动评估者的角色，但它们仍然容易受到认知偏差的影响——当面对共识性主张或权威诉求等虚假提示级线索时，常常会改变其推理过程。通过提示或监督微调进行的现有缓解措施难以泛化，因为它们仅修改了表面行为，而未改变使偏差线索具有吸引力的优化目标。我们提出**认知独立性训练（Epistemic Independence Training, EIT）**，这是一个基于关键原则的强化学习框架：要学习独立性，必须使偏差线索对奖励不再具有预测性。EIT通过一种平衡冲突策略（其中偏差信号对正确和错误答案的支持概率相等）与奖励设计（惩罚跟随偏差但不奖励与偏差一致的行为）相结合来实现这一目标。在Qwen3-4B上的实验表明，EIT在对抗性偏差下提高了准确性和鲁棒性，同时保留了当偏差与真相一致时的性能。值得注意的是，仅在从众偏差上训练的模型能够泛化到未见过的偏差类型，如权威偏差和干扰偏差，这表明EIT诱导了可迁移的认知独立性，而非特定于偏差的启发式规则。修订后的EIT还能跨基准（MedQA, HellaSwag）、模型族（Llama-3.2-3B）和规模（Qwen3-8B）进行泛化，并且在无需环境标签的情况下优于分布偏移方法（GroupDRO, IRM）。代码和数据可在 https://anonymous.4open.science/r/bias-mitigation-with-rl-BC47 获取。