Ideological Bias in LLMs' Economic Causal Reasoning

Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

翻译：大语言模型在经济因果效应推理中是否表现出系统性意识形态偏差？随着LLMs越来越多地被用于政策分析和经济报道——在这些场景中，方向正确的因果判断至关重要——这一问题具有直接的实际意义。我们通过将EconCausal基准扩展至包含意识形态争议案例（即干预导向型（亲政府）与市场导向型（亲市场）视角对因果方向预测存在分歧的情况），进行了系统性评估。基于从顶级经济与金融期刊中提取的10,490个因果三元组（即具有经验验证效应方向的处理-结果配对），我们识别出1,056个意识形态争议实例，并评估了20个前沿LLMs预测经验支持因果方向的能力。研究发现，意识形态争议项始终比非争议项更难处理；且在20个模型中，有18个模型在经验验证的因果方向与干预导向型预期一致时，其准确率系统性地高于与市场导向型预期一致时。此外，当模型出错时，其错误预测不成比例地偏向干预导向型，且这种方向性偏差无法通过单次上下文提示消除。这些结果突显了LLMs在意识形态争议性经济问题上不仅准确性更低，而且在一个意识形态方向上的可靠性系统性低于另一方向，这强调了在高风险经济与政策场景中引入方向感知评估的必要性。