Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai
翻译:大型语言模型(LLMs)在多种敏感场景中表现出系统性的政治偏见。我们发现LLMs在处理对立政治倾向的对等话题时存在不对称性。我们将此现象称为隐性政治偏见,并识别出其运作的7类技术手段。我们提出两种隐性偏见度量指标:情感一致性(Sentiment Consistency)衡量跨配对政治提示的修辞与框架对称性;助益一致性(Helpfulness Consistency)衡量响应深度与参与度的对称性。为减少这两类隐性偏见,我们引入政治一致性训练(Political Consistency Training, PCT),一种包含两种互补范式的强化学习方法:情感一致性训练和助益一致性训练。研究表明PCT在保持整体助益性的同时,显著减少了隐性政治偏见,并可在保留基准测试中泛化。相关代码已发布于https://political-manipulation.ai