In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is significant for creating consistent meaningful causal models, despite the challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.
翻译:在实际的统计因果发现(SCD)中,尽管系统性地获取背景知识存在挑战,但将领域专家知识作为约束嵌入算法对于构建一致且有意义的因果模型至关重要。为克服这些挑战,本文提出一种新颖的因果推理方法,该方法通过面向大型语言模型(LLM)的“统计因果提示(SCP)”以及面向SCD的先验知识增强,将SCD方法与基于LLM的知识因果推理(KBCI)相融合。实验表明,GPT-4能够使LLM-KBCI的输出结果以及融合了LLM-KBCI先验知识的SCD结果逼近真实因果结构;若GPT-4经过SCP处理,SCD结果还可得到进一步改善。此外,通过使用一个未公开的真实世界数据集,我们证明了即使该数据集从未包含在LLM的训练数据中,LLM所提供的背景知识仍能提升该数据集上的SCD性能。因此,所提出的方法能够应对诸如数据集偏差和局限性等挑战,展现了LLM在跨不同科学领域改进数据驱动因果推理方面的潜力。