Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.
翻译:因果推理对于决策至关重要,但对非专业人士而言仍具挑战性。尽管大语言模型(LLMs)在该领域展现出潜力,但其精确的因果估计能力仍然有限,且后训练对这些能力的影响尚未得到充分探索。本文研究了后训练能在多大程度上增强LLMs的因果推理能力。我们提出了CauGym,一个包含七个核心因果训练任务和五个多样化测试集的综合数据集。利用该数据集,我们系统评估了五种后训练方法:SFT、DPO、KTO、PPO和GRPO。在五个领域内基准和四个现有基准上的实验表明,适当的后训练能使较小规模的LLMs在因果推理任务上具备竞争力,其表现往往超越规模大得多的模型。我们的140亿参数模型在CaLM基准上达到了93.5%的准确率,而OpenAI o3模型的准确率为55.4%。此外,后训练的LLMs在分布偏移和噪声数据等现实条件下展现出强大的泛化能力和鲁棒性。这些发现共同提供了首个系统性证据,表明有针对性的后训练能够产生可靠且鲁棒的基于LLM的因果推理器。我们的数据和GRPO模型已在https://github.com/OpenCausaLab/CauGym公开。